Circling some off-list comments back to the list...while we could and
should error-out easier, this really isn't a supportable operation.
What the cmd
mpirun -n 2 -slot-list 1,3 foo
appears to do is cause us to launch a 2-process job consisting of
vpid=1 and vpid=3, as opposed to the normal vpid=0 and 1.
Not only is ORTE not prepared to handle this scenario, I believe it
will cause problems in some areas within OMPI.
I can try to make it fail nicer - someone with more knowledge of the
intended slot-list behavior would have to make it do what they
actually intended, or at least explain what is supposed o happen.
On Sep 24, 2009, at 7:03 PM, Eugene Loh wrote:
> mpirun -V
> mpirun (Open MPI) 1.4a1-1
> Ralph Castain wrote:
>> Sigh - you really need to remember to tell us what version you're
>> talking about.
>> On Sep 24, 2009, at 5:39 PM, Eugene Loh wrote:
>>> I assume this is a bug?
>>> % mpirun -np 2 -slot-list 1,3 hostname
>>> [saem9:10337] [[455,0],0] ORTE_ERROR_LOG: Not found in file base/
>>> odls_base_default_fns.c at line 875
>>> [saem9:10337] *** Process received signal ***
>>> [saem9:10337] Signal: Segmentation fault (11)
>>> [saem9:10337] Signal code: Address not mapped (1)
>>> [saem9:10337] Failing at address: 0x4c
>>> [saem9:10337] [ 0] [0xffffe600]
>>> [saem9:10337] [ 1] /home/eugene/CTperf/test-CT821/paff_bug2/src/
>>> [saem9:10337] [ 2] /home/eugene/CTperf/test-CT821/paff_bug2/src/
>>> myopt/lib/openmpi/mca_plm_rsh.so [0xf7d13564]
>>> [saem9:10337] [ 3] mpirun [0x804b49d]
>>> [saem9:10337] [ 4] mpirun [0x804a456]
>>> [saem9:10337] [ 5] /lib/libc.so.6(__libc_start_main+0xdc)
>>> [saem9:10337] [ 6] mpirun(orte_daemon_recv+0x201) [0x804a3b1]
>>> [saem9:10337] *** End of error message ***
>>> Segmentation fault
> devel mailing list