Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-07-14 12:29:23


No, it's not working as I expect , unless I expect somthing wrong .
( sorry for the long PATH, I needed to provide it )

$LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
-np 2 -H witch1,witch2 hostname
witch1
witch2

$LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
-np 2 -H witch1,witch2 -app appfile
dellix7
dellix7
$cat appfile
-np 1 hostname
-np 1 hostname

On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Run it without the appfile, just putting the apps on the cmd line - does it
> work right then?
>
> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote:
>
> additional info
> I am running mpirun on hostA, and providing hostlist with hostB and hostC.
> I expect that each application would run on hostB and hostC, but I get all
> of them running on hostA.
> dellix7$cat appfile
> -np 1 hostname
> -np 1 hostname
> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile
> dellix7
> dellix7
> Thanks
> Lenny.
>
> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Strange - let me have a look at it later today. Probably something simple
>> that another pair of eyes might spot.
>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote:
>>
>> Seems like connected problem:
>> I can't use rankfile with app, even after all those fixes ( working with
>> trunk 1.4a1r21657).
>> This is my case :
>>
>> $cat rankfile
>> rank 0=+n1 slot=0
>> rank 1=+n0 slot=0
>> $cat appfile
>> -np 1 hostname
>> -np 1 hostname
>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
>> --------------------------------------------------------------------------
>> Rankfile claimed host +n1 by index that is bigger than number of allocated
>> hosts.
>> --------------------------------------------------------------------------
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>>
>>
>> The problem is, that rankfile mapper tries to find an appropriate host in
>> the partial ( and not full ) hostlist.
>>
>> Any suggestions how to fix it?
>>
>> Thanks
>> Lenny.
>>
>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Okay, I fixed this today too....r21219
>>>
>>>
>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
>>>
>>> Now there is another problem :)
>>>>
>>>> You can try oversubscribe node. At least by 1 task.
>>>> If you hostfile and rank file limit you at N procs, you can ask mpirun
>>>> for N+1 and it wil be not rejected.
>>>> Although in reality there will be N tasks.
>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5"
>>>> both works, but in both cases there are only 4 tasks. It isn't crucial,
>>>> because there is nor real oversubscription, but there is still some bug
>>>> which can affect something in future.
>>>>
>>>> --
>>>> Anton Starikov.
>>>>
>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
>>>>
>>>> This is fixed as of r21208.
>>>>>
>>>>> Thanks for reporting it!
>>>>> Ralph
>>>>>
>>>>>
>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
>>>>>
>>>>> Although removing this check solves problem of having more slots in
>>>>>> rankfile than necessary, there is another problem.
>>>>>>
>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
>>>>>>
>>>>>>
>>>>>> hostfile:
>>>>>>
>>>>>> node01
>>>>>> node01
>>>>>> node02
>>>>>> node02
>>>>>>
>>>>>> rankfile:
>>>>>>
>>>>>> rank 0=node01 slot=1
>>>>>> rank 1=node01 slot=0
>>>>>> rank 2=node02 slot=1
>>>>>> rank 3=node02 slot=0
>>>>>>
>>>>>> mpirun -np 4 ./something
>>>>>>
>>>>>> complains with:
>>>>>>
>>>>>> "There are not enough slots available in the system to satisfy the 4
>>>>>> slots
>>>>>> that were requested by the application"
>>>>>>
>>>>>> but "mpirun -np 3 ./something" will work though. It works, when you
>>>>>> ask for 1 CPU less. And the same behavior in any case (shared nodes,
>>>>>> non-shared nodes, multi-node)
>>>>>>
>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and all
>>>>>> affinities set as it requested in rankfile, there is no oversubscription.
>>>>>>
>>>>>>
>>>>>> Anton.
>>>>>>
>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
>>>>>>
>>>>>> Ah - thx for catching that, I'll remove that check. It no longer is
>>>>>>> required.
>>>>>>>
>>>>>>> Thx!
>>>>>>>
>>>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <
>>>>>>> lenny.verkhovsky_at_[hidden]> wrote:
>>>>>>> According to the code it does cares.
>>>>>>>
>>>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
>>>>>>>
>>>>>>> ival = orte_rmaps_rank_file_value.ival;
>>>>>>> if ( ival > (np-1) ) {
>>>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,
>>>>>>> ival, rankfile);
>>>>>>> rc = ORTE_ERR_BAD_PARAM;
>>>>>>> goto unlock;
>>>>>>> }
>>>>>>>
>>>>>>> If I remember correctly, I used an array to map ranks, and since the
>>>>>>> length of array is NP, maximum index must be less than np, so if you have
>>>>>>> the number of rank > NP, you have no place to put it inside array.
>>>>>>>
>>>>>>> "Likewise, if you have more procs than the rankfile specifies, we map
>>>>>>> the additional procs either byslot (default) or bynode (if you specify that
>>>>>>> option). So the rankfile doesn't need to contain an entry for every proc."
>>>>>>> - Correct point.
>>>>>>>
>>>>>>>
>>>>>>> Lenny.
>>>>>>>
>>>>>>>
>>>>>>> On 5/5/09, Ralph Castain <rhc_at_[hidden]> wrote: Sorry Lenny, but
>>>>>>> that isn't correct. The rankfile mapper doesn't care if the rankfile
>>>>>>> contains additional info - it only maps up to the number of processes, and
>>>>>>> ignores anything beyond that number. So there is no need to remove the
>>>>>>> additional info.
>>>>>>>
>>>>>>> Likewise, if you have more procs than the rankfile specifies, we map
>>>>>>> the additional procs either byslot (default) or bynode (if you specify that
>>>>>>> option). So the rankfile doesn't need to contain an entry for every proc.
>>>>>>>
>>>>>>> Just don't want to confuse folks.
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <
>>>>>>> lenny.verkhovsky_at_[hidden]> wrote:
>>>>>>> Hi,
>>>>>>> maximum rank number must be less then np.
>>>>>>> if np=1 then there is only rank 0 in the system, so rank 1 is
>>>>>>> invalid.
>>>>>>> please remove "rank 1=node2 slot=*" from the rankfile
>>>>>>> Best regards,
>>>>>>> Lenny.
>>>>>>>
>>>>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot <
>>>>>>> geopignot_at_[hidden]> wrote:
>>>>>>> Hi ,
>>>>>>>
>>>>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my
>>>>>>> command doesn't work
>>>>>>>
>>>>>>> cat rankf:
>>>>>>> rank 0=node1 slot=*
>>>>>>> rank 1=node2 slot=*
>>>>>>>
>>>>>>> cat hostf:
>>>>>>> node1 slots=2
>>>>>>> node2 slots=2
>>>>>>>
>>>>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 hostname
>>>>>>> : --host node2 -n 1 hostname
>>>>>>>
>>>>>>> Error, invalid rank (1) in the rankfile (rankf)
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>> rmaps_rank_file.c at line 403
>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>> base/rmaps_base_map_job.c at line 86
>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>> base/plm_base_launch_support.c at line 86
>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>> plm_rsh_module.c at line 1016
>>>>>>>
>>>>>>>
>>>>>>> Ralph, could you tell me if my command syntax is correct or not ? if
>>>>>>> not, give me the expected one ?
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Geoffroy
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2009/4/30 Geoffroy Pignot <geopignot_at_[hidden]>
>>>>>>>
>>>>>>> Immediately Sir !!! :)
>>>>>>>
>>>>>>> Thanks again Ralph
>>>>>>>
>>>>>>> Geoffroy
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> Message: 2
>>>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>>>>>>> From: Ralph Castain <rhc_at_[hidden]>
>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>> Message-ID:
>>>>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f7e6_at_[hidden]>
>>>>>>> Content-Type: text/plain; charset="iso-8859-1"
>>>>>>>
>>>>>>> I believe this is fixed now in our development trunk - you can
>>>>>>> download any
>>>>>>> tarball starting from last night and give it a try, if you like. Any
>>>>>>> feedback would be appreciated.
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Ah now, I didn't say it -worked-, did I? :-)
>>>>>>>
>>>>>>> Clearly a bug exists in the program. I'll try to take a look at it
>>>>>>> (if Lenny
>>>>>>> doesn't get to it first), but it won't be until later in the week.
>>>>>>>
>>>>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>>>>>>
>>>>>>> I agree with you Ralph , and that 's what I expect from openmpi but
>>>>>>> my
>>>>>>> second example shows that it's not working
>>>>>>>
>>>>>>> cat hostfile.0
>>>>>>> r011n002 slots=4
>>>>>>> r011n003 slots=4
>>>>>>>
>>>>>>> cat rankfile.0
>>>>>>> rank 0=r011n002 slot=0
>>>>>>> rank 1=r011n003 slot=1
>>>>>>>
>>>>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>>>>>> hostname
>>>>>>> ### CRASHED
>>>>>>>
>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > rmaps_rank_file.c at line 404
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > base/rmaps_base_map_job.c at line 87
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > base/plm_base_launch_support.c at line 77
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > plm_rsh_module.c at line 985
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>>> > attempting to
>>>>>>> > > launch so we are aborting.
>>>>>>> > >
>>>>>>> > > There may be more information reported by the environment (see
>>>>>>> > above).
>>>>>>> > >
>>>>>>> > > This may be because the daemon was unable to find all the needed
>>>>>>> > shared
>>>>>>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>> > have the
>>>>>>> > > location of the shared libraries on the remote nodes and this
>>>>>>> will
>>>>>>> > > automatically be forwarded to the remote nodes.
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > orterun noticed that the job aborted, but has no info as to the
>>>>>>> > process
>>>>>>> > > that caused that situation.
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > orterun: clean termination accomplished
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Message: 4
>>>>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600
>>>>>>> From: Ralph Castain <rhc_at_[hidden]>
>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>> Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
>>>>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>>>>> DelSp="yes"
>>>>>>>
>>>>>>> The rankfile cuts across the entire job - it isn't applied on an
>>>>>>> app_context basis. So the ranks in your rankfile must correspond to
>>>>>>> the eventual rank of each process in the cmd line.
>>>>>>>
>>>>>>> Unfortunately, that means you have to count ranks. In your case, you
>>>>>>> only have four, so that makes life easier. Your rankfile would look
>>>>>>> something like this:
>>>>>>>
>>>>>>> rank 0=r001n001 slot=0
>>>>>>> rank 1=r001n002 slot=1
>>>>>>> rank 2=r001n001 slot=1
>>>>>>> rank 3=r001n002 slot=2
>>>>>>>
>>>>>>> HTH
>>>>>>> Ralph
>>>>>>>
>>>>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>>>>>>>
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I agree that my examples are not very clear. What I want to do is
>>>>>>> to
>>>>>>> > launch a multiexes application (masters-slaves) and benefit from
>>>>>>> the
>>>>>>> > processor affinity.
>>>>>>> > Could you show me how to convert this command , using -rf option
>>>>>>> > (whatever the affinity is)
>>>>>>> >
>>>>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002
>>>>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>>>>>>> > host r001n002 slave.x options4
>>>>>>> >
>>>>>>> > Thanks for your help
>>>>>>> >
>>>>>>> > Geoffroy
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Message: 2
>>>>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300
>>>>>>> > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
>>>>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>>>>> > To: Open MPI Users <users_at_[hidden]>
>>>>>>> > Message-ID:
>>>>>>> > <
>>>>>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]>
>>>>>>> > Content-Type: text/plain; charset="iso-8859-1"
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1
>>>>>>> > defined,
>>>>>>> > while n=1, which means only rank 0 is present and can be allocated.
>>>>>>> >
>>>>>>> > NP must be >= the largest rank in rankfile.
>>>>>>> >
>>>>>>> > What exactly are you trying to do ?
>>>>>>> >
>>>>>>> > I tried to recreate your seqv but all I got was
>>>>>>> >
>>>>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>>>>>>> > hostfile.0
>>>>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>>>>>>> > [witch19:30798] mca: base: component_find: paffinity
>>>>>>> > "mca_paffinity_linux"
>>>>>>> > uses an MCA interface that is not recognized (component MCA v1.0.0
>>>>>>> !=
>>>>>>> > supported MCA v2.0.0) -- ignored
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > It looks like opal_init failed for some reason; your parallel
>>>>>>> > process is
>>>>>>> > likely to abort. There are many reasons that a parallel process can
>>>>>>> > fail during opal_init; some of which are due to configuration or
>>>>>>> > environment problems. This failure appears to be an internal
>>>>>>> failure;
>>>>>>> > here's some additional information (which may only be relevant to
>>>>>>> an
>>>>>>> > Open MPI developer):
>>>>>>> >
>>>>>>> > opal_carto_base_select failed
>>>>>>> > --> Returned value -13 instead of OPAL_SUCCESS
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>>>>>> file
>>>>>>> > ../../orte/runtime/orte_init.c at line 78
>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>>>>>> file
>>>>>>> > ../../orte/orted/orted_main.c at line 344
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > A daemon (pid 11629) died unexpectedly with status 243 while
>>>>>>> > attempting
>>>>>>> > to launch so we are aborting.
>>>>>>> >
>>>>>>> > There may be more information reported by the environment (see
>>>>>>> above).
>>>>>>> >
>>>>>>> > This may be because the daemon was unable to find all the needed
>>>>>>> > shared
>>>>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>> > have the
>>>>>>> > location of the shared libraries on the remote nodes and this will
>>>>>>> > automatically be forwarded to the remote nodes.
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > mpirun noticed that the job aborted, but has no info as to the
>>>>>>> process
>>>>>>> > that caused that situation.
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > mpirun: clean termination accomplished
>>>>>>> >
>>>>>>> >
>>>>>>> > Lenny.
>>>>>>> >
>>>>>>> >
>>>>>>> > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
>>>>>>> > >
>>>>>>> > > Hi ,
>>>>>>> > >
>>>>>>> > > I am currently testing the process affinity capabilities of
>>>>>>> > openmpi and I
>>>>>>> > > would like to know if the rankfile behaviour I will describe
>>>>>>> below
>>>>>>> > is normal
>>>>>>> > > or not ?
>>>>>>> > >
>>>>>>> > > cat hostfile.0
>>>>>>> > > r011n002 slots=4
>>>>>>> > > r011n003 slots=4
>>>>>>> > >
>>>>>>> > > cat rankfile.0
>>>>>>> > > rank 0=r011n002 slot=0
>>>>>>> > > rank 1=r011n003 slot=1
>>>>>>> > >
>>>>>>> > >
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>> ##################################################################################
>>>>>>> > >
>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK
>>>>>>> > > r011n002
>>>>>>> > > r011n003
>>>>>>> > >
>>>>>>> > >
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>> ##################################################################################
>>>>>>> > > but
>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>>>>>> > hostname
>>>>>>> > > ### CRASHED
>>>>>>> > > *
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > rmaps_rank_file.c at line 404
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > base/rmaps_base_map_job.c at line 87
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > base/plm_base_launch_support.c at line 77
>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>> file
>>>>>>> > > plm_rsh_module.c at line 985
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>>> > attempting to
>>>>>>> > > launch so we are aborting.
>>>>>>> > >
>>>>>>> > > There may be more information reported by the environment (see
>>>>>>> > above).
>>>>>>> > >
>>>>>>> > > This may be because the daemon was unable to find all the needed
>>>>>>> > shared
>>>>>>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>> > have the
>>>>>>> > > location of the shared libraries on the remote nodes and this
>>>>>>> will
>>>>>>> > > automatically be forwarded to the remote nodes.
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > orterun noticed that the job aborted, but has no info as to the
>>>>>>> > process
>>>>>>> > > that caused that situation.
>>>>>>> > >
>>>>>>> >
>>>>>>> --------------------------------------------------------------------------
>>>>>>> > > orterun: clean termination accomplished
>>>>>>> > > *
>>>>>>> > > It seems that the rankfile option is not propagted to the second
>>>>>>> > command
>>>>>>> > > line ; there is no global understanding of the ranking inside a
>>>>>>> > mpirun
>>>>>>> > > command.
>>>>>>> > >
>>>>>>> > >
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>> ##################################################################################
>>>>>>> > >
>>>>>>> > > Assuming that , I tried to provide a rankfile to each command
>>>>>>> line:
>>>>>>> > >
>>>>>>> > > cat rankfile.0
>>>>>>> > > rank 0=r011n002 slot=0
>>>>>>> > >
>>>>>>> > > cat rankfile.1
>>>>>>> > > rank 0=r011n003 slot=1
>>>>>>> > >
>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
>>>>>>> > rankfile.1
>>>>>>> > > -n 1 hostname ### CRASHED
>>>>>>> > > *[r011n002:28778] *** Process received signal ***
>>>>>>> > > [r011n002:28778] Signal: Segmentation fault (11)
>>>>>>> > > [r011n002:28778] Signal code: Address not mapped (1)
>>>>>>> > > [r011n002:28778] Failing at address: 0x34
>>>>>>> > > [r011n002:28778] [ 0] [0xffffe600]
>>>>>>> > > [r011n002:28778] [ 1]
>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>>>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>>>>>>> > > [0x5557decd]
>>>>>>> > > [r011n002:28778] [ 2]
>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>>>>>> > 0(orte_plm_base_launch_apps+0x117)
>>>>>>> > > [0x555842a7]
>>>>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
>>>>>>> > mca_plm_rsh.so
>>>>>>> > > [0x556098c0]
>>>>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>>>>> > [0x804aa27]
>>>>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>>>>> > [0x804a022]
>>>>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
>>>>>>> > [0x9f1dec]
>>>>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>>>>> > [0x8049f71]
>>>>>>> > > [r011n002:28778] *** End of error message ***
>>>>>>> > > Segmentation fault (core dumped)*
>>>>>>> > >
>>>>>>> > >
>>>>>>> > >
>>>>>>> > > I hope that I've found a bug because it would be very important
>>>>>>> > for me to
>>>>>>> > > have this kind of capabiliy .
>>>>>>> > > Launch a multiexe mpirun command line and be able to bind my exes
>>>>>>> > and
>>>>>>> > > sockets together.
>>>>>>> > >
>>>>>>> > > Thanks in advance for your help
>>>>>>> > >
>>>>>>> > > Geoffroy
>>>>>>> > _______________________________________________
>>>>>>> > users mailing list
>>>>>>> > users_at_[hidden]
>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> -------------- next part --------------
>>>>>>> HTML attachment scrubbed and removed
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> End of users Digest, Vol 1202, Issue 2
>>>>>>> **************************************
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> -------------- next part --------------
>>>>>>> HTML attachment scrubbed and removed
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> End of users Digest, Vol 1218, Issue 2
>>>>>>> **************************************
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>