Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-15 07:02:28


Try your "not working" example without the -H on the mpirun cmd line -
i.e.,, just use "mpirun -np 2 -rf rankfile -app appfile". Does that
work?

Sorry to have to keep asking you to try things - I don't have a setup
here where I can test this as everything is RM managed.

On Jul 15, 2009, at 12:09 AM, Lenny Verkhovsky wrote:

>
> Thanks Ralph, after playing with prefixes it worked,
>
> I still have a problem running app file with rankfile, by providing
> full hostlist in mpirun command and not in app file.
> Is is planned behaviour, or it can be fixed ?
>
> See Working example:
>
> $cat rankfile
> rank 0=+n1 slot=0
> rank 1=+n0 slot=0
> $cat appfile
> -np 1 -H witch1,witch2 ./hello_world
> -np 1 -H witch1,witch2 ./hello_world
>
> $mpirun -rf rankfile -app appfile
> Hello world! I'm 1 of 2 on witch1
> Hello world! I'm 0 of 2 on witch2
>
> See NOT working example:
>
> $cat appfile
> -np 1 -H witch1 ./hello_world
> -np 1 -H witch2 ./hello_world
> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
> --------------------------------------------------------------------------
> Rankfile claimed host +n1 by index that is bigger than number of
> allocated hosts.
> --------------------------------------------------------------------------
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in
> file ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at
> line 422
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in
> file ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in
> file ../../../../orte/mca/plm/base/plm_base_launch_support.c at line
> 103
> [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in
> file ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>
>
>
> On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> Took a deeper look into this, and I think that your first guess was
> correct.
>
> When we changed hostfile and -host to be per-app-context options, it
> became necessary for you to put that info in the appfile itself. So
> try adding it there. What you would need in your appfile is the
> following:
>
> -np 1 -H witch1 hostname
> -np 1 -H witch2 hostname
>
> That should get you what you want.
> Ralph
>
> On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote:
>
>> No, it's not working as I expect , unless I expect somthing wrong .
>> ( sorry for the long PATH, I needed to provide it )
>>
>> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/
>> build_x86-64/install/lib/ /hpc/home/USERS/lennyb/work/svn/ompi/
>> trunk/build_x86-64/install/bin/mpirun -np 2 -H witch1,witch2 hostname
>> witch1
>> witch2
>>
>> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/
>> build_x86-64/install/lib/ /hpc/home/USERS/lennyb/work/svn/ompi/
>> trunk/build_x86-64/install/bin/mpirun -np 2 -H witch1,witch2 -app
>> appfile
>> dellix7
>> dellix7
>> $cat appfile
>> -np 1 hostname
>> -np 1 hostname
>>
>>
>> On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <rhc_at_[hidden]>
>> wrote:
>> Run it without the appfile, just putting the apps on the cmd line -
>> does it work right then?
>>
>> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote:
>>
>>> additional info
>>> I am running mpirun on hostA, and providing hostlist with hostB
>>> and hostC.
>>> I expect that each application would run on hostB and hostC, but I
>>> get all of them running on hostA.
>>> dellix7$cat appfile
>>> -np 1 hostname
>>> -np 1 hostname
>>> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile
>>> dellix7
>>> dellix7
>>> Thanks
>>> Lenny.
>>>
>>> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <rhc_at_[hidden]>
>>> wrote:
>>> Strange - let me have a look at it later today. Probably something
>>> simple that another pair of eyes might spot.
>>>
>>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote:
>>>
>>>> Seems like connected problem:
>>>> I can't use rankfile with app, even after all those fixes
>>>> ( working with trunk 1.4a1r21657).
>>>> This is my case :
>>>>
>>>> $cat rankfile
>>>> rank 0=+n1 slot=0
>>>> rank 1=+n0 slot=0
>>>> $cat appfile
>>>> -np 1 hostname
>>>> -np 1 hostname
>>>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
>>>> --------------------------------------------------------------------------
>>>> Rankfile claimed host +n1 by index that is bigger than number of
>>>> allocated hosts.
>>>> --------------------------------------------------------------------------
>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at
>>>> line 422
>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line
>>>> 85
>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file ../../../../orte/mca/plm/base/plm_base_launch_support.c at
>>>> line 103
>>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>>>>
>>>>
>>>> The problem is, that rankfile mapper tries to find an appropriate
>>>> host in the partial ( and not full ) hostlist.
>>>>
>>>> Any suggestions how to fix it?
>>>>
>>>> Thanks
>>>> Lenny.
>>>>
>>>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <rhc_at_[hidden]>
>>>> wrote:
>>>> Okay, I fixed this today too....r21219
>>>>
>>>>
>>>>
>>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
>>>>
>>>> Now there is another problem :)
>>>>
>>>> You can try oversubscribe node. At least by 1 task.
>>>> If you hostfile and rank file limit you at N procs, you can ask
>>>> mpirun for N+1 and it wil be not rejected.
>>>> Although in reality there will be N tasks.
>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -
>>>> np 5" both works, but in both cases there are only 4 tasks. It
>>>> isn't crucial, because there is nor real oversubscription, but
>>>> there is still some bug which can affect something in future.
>>>>
>>>> --
>>>> Anton Starikov.
>>>>
>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
>>>>
>>>> This is fixed as of r21208.
>>>>
>>>> Thanks for reporting it!
>>>> Ralph
>>>>
>>>>
>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
>>>>
>>>> Although removing this check solves problem of having more slots
>>>> in rankfile than necessary, there is another problem.
>>>>
>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
>>>>
>>>>
>>>> hostfile:
>>>>
>>>> node01
>>>> node01
>>>> node02
>>>> node02
>>>>
>>>> rankfile:
>>>>
>>>> rank 0=node01 slot=1
>>>> rank 1=node01 slot=0
>>>> rank 2=node02 slot=1
>>>> rank 3=node02 slot=0
>>>>
>>>> mpirun -np 4 ./something
>>>>
>>>> complains with:
>>>>
>>>> "There are not enough slots available in the system to satisfy
>>>> the 4 slots
>>>> that were requested by the application"
>>>>
>>>> but "mpirun -np 3 ./something" will work though. It works, when
>>>> you ask for 1 CPU less. And the same behavior in any case (shared
>>>> nodes, non-shared nodes, multi-node)
>>>>
>>>> If you switch off rmaps_base_no_oversubscribe, then it works and
>>>> all affinities set as it requested in rankfile, there is no
>>>> oversubscription.
>>>>
>>>>
>>>> Anton.
>>>>
>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
>>>>
>>>> Ah - thx for catching that, I'll remove that check. It no longer
>>>> is required.
>>>>
>>>> Thx!
>>>>
>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]
>>>> > wrote:
>>>> According to the code it does cares.
>>>>
>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
>>>>
>>>> ival = orte_rmaps_rank_file_value.ival;
>>>> if ( ival > (np-1) ) {
>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,
>>>> ival, rankfile);
>>>> rc = ORTE_ERR_BAD_PARAM;
>>>> goto unlock;
>>>> }
>>>>
>>>> If I remember correctly, I used an array to map ranks, and since
>>>> the length of array is NP, maximum index must be less than np, so
>>>> if you have the number of rank > NP, you have no place to put it
>>>> inside array.
>>>>
>>>> "Likewise, if you have more procs than the rankfile specifies, we
>>>> map the additional procs either byslot (default) or bynode (if
>>>> you specify that option). So the rankfile doesn't need to contain
>>>> an entry for every proc." - Correct point.
>>>>
>>>>
>>>> Lenny.
>>>>
>>>>
>>>> On 5/5/09, Ralph Castain <rhc_at_[hidden]> wrote: Sorry Lenny,
>>>> but that isn't correct. The rankfile mapper doesn't care if the
>>>> rankfile contains additional info - it only maps up to the number
>>>> of processes, and ignores anything beyond that number. So there
>>>> is no need to remove the additional info.
>>>>
>>>> Likewise, if you have more procs than the rankfile specifies, we
>>>> map the additional procs either byslot (default) or bynode (if
>>>> you specify that option). So the rankfile doesn't need to contain
>>>> an entry for every proc.
>>>>
>>>> Just don't want to confuse folks.
>>>> Ralph
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]
>>>> > wrote:
>>>> Hi,
>>>> maximum rank number must be less then np.
>>>> if np=1 then there is only rank 0 in the system, so rank 1 is
>>>> invalid.
>>>> please remove "rank 1=node2 slot=*" from the rankfile
>>>> Best regards,
>>>> Lenny.
>>>>
>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot <geopignot_at_[hidden]
>>>> > wrote:
>>>> Hi ,
>>>>
>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately
>>>> my command doesn't work
>>>>
>>>> cat rankf:
>>>> rank 0=node1 slot=*
>>>> rank 1=node2 slot=*
>>>>
>>>> cat hostf:
>>>> node1 slots=2
>>>> node2 slots=2
>>>>
>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1
>>>> hostname : --host node2 -n 1 hostname
>>>>
>>>> Error, invalid rank (1) in the rankfile (rankf)
>>>>
>>>> --------------------------------------------------------------------------
>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file rmaps_rank_file.c at line 403
>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file base/rmaps_base_map_job.c at line 86
>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file base/plm_base_launch_support.c at line 86
>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>> file plm_rsh_module.c at line 1016
>>>>
>>>>
>>>> Ralph, could you tell me if my command syntax is correct or not ?
>>>> if not, give me the expected one ?
>>>>
>>>> Regards
>>>>
>>>> Geoffroy
>>>>
>>>>
>>>>
>>>>
>>>> 2009/4/30 Geoffroy Pignot <geopignot_at_[hidden]>
>>>>
>>>> Immediately Sir !!! :)
>>>>
>>>> Thanks again Ralph
>>>>
>>>> Geoffroy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 2
>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>>>> From: Ralph Castain <rhc_at_[hidden]>
>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>> To: Open MPI Users <users_at_[hidden]>
>>>> Message-ID:
>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f7e6_at_[hidden]>
>>>> Content-Type: text/plain; charset="iso-8859-1"
>>>>
>>>> I believe this is fixed now in our development trunk - you can
>>>> download any
>>>> tarball starting from last night and give it a try, if you like.
>>>> Any
>>>> feedback would be appreciated.
>>>>
>>>> Ralph
>>>>
>>>>
>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>>>
>>>> Ah now, I didn't say it -worked-, did I? :-)
>>>>
>>>> Clearly a bug exists in the program. I'll try to take a look at
>>>> it (if Lenny
>>>> doesn't get to it first), but it won't be until later in the week.
>>>>
>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>>>
>>>> I agree with you Ralph , and that 's what I expect from openmpi
>>>> but my
>>>> second example shows that it's not working
>>>>
>>>> cat hostfile.0
>>>> r011n002 slots=4
>>>> r011n003 slots=4
>>>>
>>>> cat rankfile.0
>>>> rank 0=r011n002 slot=0
>>>> rank 1=r011n003 slot=1
>>>>
>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>>> hostname
>>>> ### CRASHED
>>>>
>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > rmaps_rank_file.c at line 404
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > base/rmaps_base_map_job.c at line 87
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > base/plm_base_launch_support.c at line 77
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > plm_rsh_module.c at line 985
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>>> > attempting to
>>>> > > launch so we are aborting.
>>>> > >
>>>> > > There may be more information reported by the environment (see
>>>> > above).
>>>> > >
>>>> > > This may be because the daemon was unable to find all the
>>>> needed
>>>> > shared
>>>> > > libraries on the remote node. You may set your
>>>> LD_LIBRARY_PATH to
>>>> > have the
>>>> > > location of the shared libraries on the remote nodes and this
>>>> will
>>>> > > automatically be forwarded to the remote nodes.
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > orterun noticed that the job aborted, but has no info as to the
>>>> > process
>>>> > > that caused that situation.
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > orterun: clean termination accomplished
>>>>
>>>>
>>>>
>>>> Message: 4
>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600
>>>> From: Ralph Castain <rhc_at_[hidden]>
>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>> To: Open MPI Users <users_at_[hidden]>
>>>> Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>> DelSp="yes"
>>>>
>>>> The rankfile cuts across the entire job - it isn't applied on an
>>>> app_context basis. So the ranks in your rankfile must correspond to
>>>> the eventual rank of each process in the cmd line.
>>>>
>>>> Unfortunately, that means you have to count ranks. In your case,
>>>> you
>>>> only have four, so that makes life easier. Your rankfile would look
>>>> something like this:
>>>>
>>>> rank 0=r001n001 slot=0
>>>> rank 1=r001n002 slot=1
>>>> rank 2=r001n001 slot=1
>>>> rank 3=r001n002 slot=2
>>>>
>>>> HTH
>>>> Ralph
>>>>
>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > I agree that my examples are not very clear. What I want to do
>>>> is to
>>>> > launch a multiexes application (masters-slaves) and benefit
>>>> from the
>>>> > processor affinity.
>>>> > Could you show me how to convert this command , using -rf option
>>>> > (whatever the affinity is)
>>>> >
>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host
>>>> r001n002
>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>>>> > host r001n002 slave.x options4
>>>> >
>>>> > Thanks for your help
>>>> >
>>>> > Geoffroy
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Message: 2
>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300
>>>> > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>> > To: Open MPI Users <users_at_[hidden]>
>>>> > Message-ID:
>>>> > <453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]
>>>> >
>>>> > Content-Type: text/plain; charset="iso-8859-1"
>>>> >
>>>> > Hi,
>>>> >
>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1
>>>> > defined,
>>>> > while n=1, which means only rank 0 is present and can be
>>>> allocated.
>>>> >
>>>> > NP must be >= the largest rank in rankfile.
>>>> >
>>>> > What exactly are you trying to do ?
>>>> >
>>>> > I tried to recreate your seqv but all I got was
>>>> >
>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>>>> > hostfile.0
>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>>>> > [witch19:30798] mca: base: component_find: paffinity
>>>> > "mca_paffinity_linux"
>>>> > uses an MCA interface that is not recognized (component MCA
>>>> v1.0.0 !=
>>>> > supported MCA v2.0.0) -- ignored
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > It looks like opal_init failed for some reason; your parallel
>>>> > process is
>>>> > likely to abort. There are many reasons that a parallel process
>>>> can
>>>> > fail during opal_init; some of which are due to configuration or
>>>> > environment problems. This failure appears to be an internal
>>>> failure;
>>>> > here's some additional information (which may only be relevant
>>>> to an
>>>> > Open MPI developer):
>>>> >
>>>> > opal_carto_base_select failed
>>>> > --> Returned value -13 instead of OPAL_SUCCESS
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
>>>> in file
>>>> > ../../orte/runtime/orte_init.c at line 78
>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
>>>> in file
>>>> > ../../orte/orted/orted_main.c at line 344
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > A daemon (pid 11629) died unexpectedly with status 243 while
>>>> > attempting
>>>> > to launch so we are aborting.
>>>> >
>>>> > There may be more information reported by the environment (see
>>>> above).
>>>> >
>>>> > This may be because the daemon was unable to find all the needed
>>>> > shared
>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>> > have the
>>>> > location of the shared libraries on the remote nodes and this
>>>> will
>>>> > automatically be forwarded to the remote nodes.
>>>> >
>>>> --------------------------------------------------------------------------
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > mpirun noticed that the job aborted, but has no info as to the
>>>> process
>>>> > that caused that situation.
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > mpirun: clean termination accomplished
>>>> >
>>>> >
>>>> > Lenny.
>>>> >
>>>> >
>>>> > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
>>>> > >
>>>> > > Hi ,
>>>> > >
>>>> > > I am currently testing the process affinity capabilities of
>>>> > openmpi and I
>>>> > > would like to know if the rankfile behaviour I will describe
>>>> below
>>>> > is normal
>>>> > > or not ?
>>>> > >
>>>> > > cat hostfile.0
>>>> > > r011n002 slots=4
>>>> > > r011n003 slots=4
>>>> > >
>>>> > > cat rankfile.0
>>>> > > rank 0=r011n002 slot=0
>>>> > > rank 1=r011n003 slot=1
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>> ##################################################################################
>>>> > >
>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname
>>>> ### OK
>>>> > > r011n002
>>>> > > r011n003
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>> ##################################################################################
>>>> > > but
>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -
>>>> n 1
>>>> > hostname
>>>> > > ### CRASHED
>>>> > > *
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > rmaps_rank_file.c at line 404
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > base/rmaps_base_map_job.c at line 87
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > base/plm_base_launch_support.c at line 77
>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
>>>> in file
>>>> > > plm_rsh_module.c at line 985
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>>> > attempting to
>>>> > > launch so we are aborting.
>>>> > >
>>>> > > There may be more information reported by the environment (see
>>>> > above).
>>>> > >
>>>> > > This may be because the daemon was unable to find all the
>>>> needed
>>>> > shared
>>>> > > libraries on the remote node. You may set your
>>>> LD_LIBRARY_PATH to
>>>> > have the
>>>> > > location of the shared libraries on the remote nodes and this
>>>> will
>>>> > > automatically be forwarded to the remote nodes.
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > orterun noticed that the job aborted, but has no info as to the
>>>> > process
>>>> > > that caused that situation.
>>>> > >
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > > orterun: clean termination accomplished
>>>> > > *
>>>> > > It seems that the rankfile option is not propagted to the
>>>> second
>>>> > command
>>>> > > line ; there is no global understanding of the ranking inside a
>>>> > mpirun
>>>> > > command.
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>> ##################################################################################
>>>> > >
>>>> > > Assuming that , I tried to provide a rankfile to each command
>>>> line:
>>>> > >
>>>> > > cat rankfile.0
>>>> > > rank 0=r011n002 slot=0
>>>> > >
>>>> > > cat rankfile.1
>>>> > > rank 0=r011n003 slot=1
>>>> > >
>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
>>>> > rankfile.1
>>>> > > -n 1 hostname ### CRASHED
>>>> > > *[r011n002:28778] *** Process received signal ***
>>>> > > [r011n002:28778] Signal: Segmentation fault (11)
>>>> > > [r011n002:28778] Signal code: Address not mapped (1)
>>>> > > [r011n002:28778] Failing at address: 0x34
>>>> > > [r011n002:28778] [ 0] [0xffffe600]
>>>> > > [r011n002:28778] [ 1]
>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>>>> > > [0x5557decd]
>>>> > > [r011n002:28778] [ 2]
>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>>> > 0(orte_plm_base_launch_apps+0x117)
>>>> > > [0x555842a7]
>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
>>>> > mca_plm_rsh.so
>>>> > > [0x556098c0]
>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>> > [0x804aa27]
>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>> > [0x804a022]
>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
>>>> > [0x9f1dec]
>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>> > [0x8049f71]
>>>> > > [r011n002:28778] *** End of error message ***
>>>> > > Segmentation fault (core dumped)*
>>>> > >
>>>> > >
>>>> > >
>>>> > > I hope that I've found a bug because it would be very important
>>>> > for me to
>>>> > > have this kind of capabiliy .
>>>> > > Launch a multiexe mpirun command line and be able to bind my
>>>> exes
>>>> > and
>>>> > > sockets together.
>>>> > >
>>>> > > Thanks in advance for your help
>>>> > >
>>>> > > Geoffroy
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > users_at_[hidden]
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> -------------- next part --------------
>>>> HTML attachment scrubbed and removed
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> End of users Digest, Vol 1202, Issue 2
>>>> **************************************
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> -------------- next part --------------
>>>> HTML attachment scrubbed and removed
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> End of users Digest, Vol 1218, Issue 2
>>>> **************************************
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users