Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Rankfile related problems
From: Bogdan Costescu (bcostescu_at_[hidden])
Date: 2010-02-15 12:39:52


With version 1.4.1 I get a rather strange crash in mpirun whenever I
try to run a job using (I think) a rankfile which doesn't contain the
specified number of ranks. F.e. I ask for 4 ranks ('-np 4'), but the
rankfile contains only one entry:

rank 0=mbm-01-24 slot=1:*

and the following comes out:

[mbm-01-24:20985] *** Process received signal ***
[mbm-01-24:20985] Signal: Segmentation fault (11)
[mbm-01-24:20985] Signal code: Address not mapped (1)
[mbm-01-24:20985] Failing at address: 0x50
[mbm-01-24:20985] [ 0] /lib64/ [0x2b9de894f7c0]
[mbm-01-24:20985] [ 1]
[mbm-01-24:20985] [ 2]
[mbm-01-24:20985] [ 3]
[mbm-01-24:20985] [ 4]
/sw/openmpi/1.4.1/gcc/4.4.3/lib/ [0x2b9de79e6251]
[mbm-01-24:20985] [ 5] mpirun [0x403782]
[mbm-01-24:20985] [ 6] mpirun [0x402cb4]
[mbm-01-24:20985] [ 7] /lib64/ [0x2b9de8b79994]
[mbm-01-24:20985] [ 8] mpirun [0x402bd9]
[mbm-01-24:20985] *** End of error message ***
Segmentation fault

However if the rankfile contains a second entry, like:

rank 0=mbm-01-24 slot=1:*
rank 1=mbm-01-24 slot=1:*

I get an error, but no segmentation fault. I guess that the
segmentation fault is unintended... Is this known ? If not, how could
I debug this ?
Now to the second problem: the exact same error keeps coming even if I
specify 4 ranks, the messages are:

mpirun was unable to start the specified application as it encountered an error:

Error name: Error
Node: mbm-01-24

when attempting to start process rank 0.
[mbm-01-24:21011] Rank 0: PAFFINITY cannot get physical core id for
logical core 4 in physical socket 1 (1)
We were unable to successfully process/set the requested processor
affinity settings:

Specified slot list: 1:*
Error: Error

This could mean that a non-existent processor was specified, or
that the specification had improper syntax.

The node has 2 slots, each with 4 cores, so what I'm trying to achieve
is using the 4 cores of the second slot. When searching the archives,
I stumbled on an e-mail from not too long ago which seemingly dealt
with the same error:

which suggests that a fix was found, but no commit was specified, so I
can't track down whether this was actually also applied to the stable
series. Could someone more knowledgeable in this area shed some light

Thanks in advance!