Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Rankfile related problems
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-03-01 15:15:17


Tracking this down has reminded me of all the reasons why I despise the rankfile mapper... :-/

I have created a fix for this mess and will submit it for inclusion in 1.4.

Thanks - not your fault, so pardon the comments. Just had my fill of this particular code since the creators of it no longer support it.
Ralph

On Mar 1, 2010, at 9:15 AM, Bogdan Costescu wrote:

> On Sat, Feb 27, 2010 at 7:35 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> I can't seem to replicate this first problem - it runs fine for me even if the rankfile contains only one entry.
>
> First of all, thanks for taking a look at this !
>
> For me it's repeatable. Please note that I do specify '-np 4' even
> when in the ranks file there is only one entry; I've just checked that
> this happens with some random value given to -np, the only time I
> don't get a segv is with '-np 1' in which case I get the 'PAFFINITY
> cannot get physical core id...' error message. However, with other
> combinations, like 2 entries in the ranks file and '-np 4' the segv
> doesn't appear, only the error message. Anyway, for the original case
> (one entry in the ranks file and -np 4'), the output obtained with the
> suggested debug is:
>
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [load_balance]
> [mbm-01-24:24102] mca:base:select:(rmaps) Skipping component
> [load_balance]. Query failed to return a module
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [rank_file]
> [mbm-01-24:24102] mca:base:select:(rmaps) Query of component
> [rank_file] set priority to 100
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [round_robin]
> [mbm-01-24:24102] mca:base:select:(rmaps) Query of component
> [round_robin] set priority to 70
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [seq]
> [mbm-01-24:24102] mca:base:select:(rmaps) Query of component [seq] set
> priority to 0
> [mbm-01-24:24102] mca:base:select:(rmaps) Selected component [rank_file]
> [mbm-01-24:24102] procdir:
> /tmp/openmpi-sessions-bq_bcostescu_at_mbm-01-24_0/36756/0/0
> [mbm-01-24:24102] jobdir: /tmp/openmpi-sessions-bq_bcostescu_at_mbm-01-24_0/36756/0
> [mbm-01-24:24102] top: openmpi-sessions-bq_bcostescu_at_mbm-01-24_0
> [mbm-01-24:24102] tmp: /tmp
> [mbm-01-24:24102] mpirun: reset PATH:
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin:/usr/local/bin:/bin:/usr/bin:/home/bq_bcostescu/bin
> [mbm-01-24:24102] mpirun: reset LD_LIBRARY_PATH:
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib
> [mbm-01-24:24102] [[36756,0],0] hostfile: checking hostfile hosts for nodes
> [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile hosts
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: adding node mbm-01-24 to map
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile hosts
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:compute_usage
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons existing
> daemon [[36756,0],0] already launched
> [mbm-01-24:24102] *** Process received signal ***
> [mbm-01-24:24102] Signal: Segmentation fault (11)
> [mbm-01-24:24102] Signal code: Address not mapped (1)
> [mbm-01-24:24102] Failing at address: 0x70
> [mbm-01-24:24102] [ 0] /lib64/libpthread.so.0 [0x2b04e8c727c0]
> [mbm-01-24:24102] [ 1]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_util_encode_pidmap+0x140)
> [0x2b04e7c5b312]
> [mbm-01-24:24102] [ 2]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0xb31)
> [0x2b04e7c89557]
> [mbm-01-24:24102] [ 3]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x1f6)
> [0x2b04e7ca9210]
> [mbm-01-24:24102] [ 4]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0
> [0x2b04e7cb3f2f]
> [mbm-01-24:24102] [ 5] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x403d3b]
> [mbm-01-24:24102] [ 6] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402ee4]
> [mbm-01-24:24102] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b04e8e9c994]
> [mbm-01-24:24102] [ 8] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402e09]
> [mbm-01-24:24102] *** End of error message ***
> Segmentation fault
>
> After applying by hand the r21728 to the original 1.4.1, I can start
> properly the job as expected, the 'PAFFINITY cannot get physical core
> id...' error message doesn't appear anymore, so I'd like to ask for it
> to be applied to the 1.4 series. With this, I've tested the following
> combinations:
>
> entries in ranks file -np result
> 1 1 OK
> 1 2 segv
> 1 4 segv
> 2 1 OK
> 2 2 OK
> 2 4 OK
> 4 4 OK
>
> So the segv's really only appear when there's only one entry in the
> ranks file; if I'm the only one to be able to reproduce these segv's,
> I'd be happy to look into it with some guidance about the actual
> source code location...
>
> Cheers,
> Bogdan
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel