Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Rankfile related problems
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-02-27 13:35:42


I can't seem to replicate this first problem - it runs fine for me even if the rankfile contains only one entry.

What you could do is build 1.4.1 with --enable-debug and then run with --debug-devel -mca rmaps_base_verbose 5 to get more info on what is happening.

On Feb 27, 2010, at 11:26 AM, Ralph Castain wrote:

> I'm looking at the first problem - will get back to you on it.
>
> As to the second issue: it was r21728, and no - it does not appear to have been moved to the 1.4 series (rankfile is untested on a regular basis). I will do so now.
>
> Thanks!
> Ralph
>
> On Feb 15, 2010, at 10:39 AM, Bogdan Costescu wrote:
>
>> Hi!
>>
>> With version 1.4.1 I get a rather strange crash in mpirun whenever I
>> try to run a job using (I think) a rankfile which doesn't contain the
>> specified number of ranks. F.e. I ask for 4 ranks ('-np 4'), but the
>> rankfile contains only one entry:
>>
>> rank 0=mbm-01-24 slot=1:*
>>
>> and the following comes out:
>>
>> [mbm-01-24:20985] *** Process received signal ***
>> [mbm-01-24:20985] Signal: Segmentation fault (11)
>> [mbm-01-24:20985] Signal code: Address not mapped (1)
>> [mbm-01-24:20985] Failing at address: 0x50
>> [mbm-01-24:20985] [ 0] /lib64/libpthread.so.0 [0x2b9de894f7c0]
>> [mbm-01-24:20985] [ 1]
>> /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbb)
>> [0x2b9de79b9f7b]
>> [mbm-01-24:20985] [ 2]
>> /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d0)
>> [0x2b9de79d49c0]
>> [mbm-01-24:20985] [ 3]
>> /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xbc)
>> [0x2b9de79e1fcc]
>> [mbm-01-24:20985] [ 4]
>> /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0 [0x2b9de79e6251]
>> [mbm-01-24:20985] [ 5] mpirun [0x403782]
>> [mbm-01-24:20985] [ 6] mpirun [0x402cb4]
>> [mbm-01-24:20985] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b9de8b79994]
>> [mbm-01-24:20985] [ 8] mpirun [0x402bd9]
>> [mbm-01-24:20985] *** End of error message ***
>> Segmentation fault
>>
>> However if the rankfile contains a second entry, like:
>>
>> rank 0=mbm-01-24 slot=1:*
>> rank 1=mbm-01-24 slot=1:*
>>
>> I get an error, but no segmentation fault. I guess that the
>> segmentation fault is unintended... Is this known ? If not, how could
>> I debug this ?
>> Now to the second problem: the exact same error keeps coming even if I
>> specify 4 ranks, the messages are:
>>
>> --------------------------------------------------------------------------
>> mpirun was unable to start the specified application as it encountered an error:
>>
>> Error name: Error
>> Node: mbm-01-24
>>
>> when attempting to start process rank 0.
>> --------------------------------------------------------------------------
>> [mbm-01-24:21011] Rank 0: PAFFINITY cannot get physical core id for
>> logical core 4 in physical socket 1 (1)
>> --------------------------------------------------------------------------
>> We were unable to successfully process/set the requested processor
>> affinity settings:
>>
>> Specified slot list: 1:*
>> Error: Error
>>
>> This could mean that a non-existent processor was specified, or
>> that the specification had improper syntax.
>> --------------------------------------------------------------------------
>>
>> The node has 2 slots, each with 4 cores, so what I'm trying to achieve
>> is using the 4 cores of the second slot. When searching the archives,
>> I stumbled on an e-mail from not too long ago which seemingly dealt
>> with the same error:
>>
>> http://www.open-mpi.org/community/lists/devel/2009/07/6513.php
>>
>> which suggests that a fix was found, but no commit was specified, so I
>> can't track down whether this was actually also applied to the stable
>> series. Could someone more knowledgeable in this area shed some light
>> ?
>>
>> Thanks in advance!
>> Bogdan
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>