Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-07-15 02:09:23


Thanks Ralph, after playing with prefixes it worked,

I still have a problem running app file with rankfile, by providing full
hostlist in mpirun command and not in app file.
Is is planned behaviour, or it can be fixed ?

See Working example:

$cat rankfile
rank 0=+n1 slot=0
rank 1=+n0 slot=0
$cat appfile
-np 1 -H witch1,witch2 ./hello_world
-np 1 -H witch1,witch2 ./hello_world

$mpirun -rf rankfile -app appfile
Hello world! I'm 1 of 2 on witch1
Hello world! I'm 0 of 2 on witch2

See NOT working example:

$cat appfile
-np 1 -H witch1 ./hello_world
-np 1 -H witch2 ./hello_world
$mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
--------------------------------------------------------------------------
Rankfile claimed host +n1 by index that is bigger than number of allocated
hosts.
--------------------------------------------------------------------------
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
[dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001

On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> Took a deeper look into this, and I think that your first guess was
> correct.
> When we changed hostfile and -host to be per-app-context options, it became
> necessary for you to put that info in the appfile itself. So try adding it
> there. What you would need in your appfile is the following:
>
> -np 1 -H witch1 hostname
> -np 1 -H witch2 hostname
>
> That should get you what you want.
> Ralph
>
> On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote:
>
> No, it's not working as I expect , unless I expect somthing wrong .
> ( sorry for the long PATH, I needed to provide it )
>
> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
> -np 2 -H witch1,witch2 hostname
> witch1
> witch2
>
> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
> -np 2 -H witch1,witch2 -app appfile
> dellix7
> dellix7
> $cat appfile
> -np 1 hostname
> -np 1 hostname
>
>
> On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Run it without the appfile, just putting the apps on the cmd line - does
>> it work right then?
>>
>> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote:
>>
>> additional info
>> I am running mpirun on hostA, and providing hostlist with hostB and hostC.
>> I expect that each application would run on hostB and hostC, but I get all
>> of them running on hostA.
>> dellix7$cat appfile
>> -np 1 hostname
>> -np 1 hostname
>> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile
>> dellix7
>> dellix7
>> Thanks
>> Lenny.
>>
>> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Strange - let me have a look at it later today. Probably something simple
>>> that another pair of eyes might spot.
>>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote:
>>>
>>> Seems like connected problem:
>>> I can't use rankfile with app, even after all those fixes ( working with
>>> trunk 1.4a1r21657).
>>> This is my case :
>>>
>>> $cat rankfile
>>> rank 0=+n1 slot=0
>>> rank 1=+n0 slot=0
>>> $cat appfile
>>> -np 1 hostname
>>> -np 1 hostname
>>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
>>>
>>> --------------------------------------------------------------------------
>>> Rankfile claimed host +n1 by index that is bigger than number of
>>> allocated hosts.
>>>
>>> --------------------------------------------------------------------------
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>>>
>>>
>>> The problem is, that rankfile mapper tries to find an appropriate host in
>>> the partial ( and not full ) hostlist.
>>>
>>> Any suggestions how to fix it?
>>>
>>> Thanks
>>> Lenny.
>>>
>>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> Okay, I fixed this today too....r21219
>>>>
>>>>
>>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
>>>>
>>>> Now there is another problem :)
>>>>>
>>>>> You can try oversubscribe node. At least by 1 task.
>>>>> If you hostfile and rank file limit you at N procs, you can ask mpirun
>>>>> for N+1 and it wil be not rejected.
>>>>> Although in reality there will be N tasks.
>>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np 5"
>>>>> both works, but in both cases there are only 4 tasks. It isn't crucial,
>>>>> because there is nor real oversubscription, but there is still some bug
>>>>> which can affect something in future.
>>>>>
>>>>> --
>>>>> Anton Starikov.
>>>>>
>>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
>>>>>
>>>>> This is fixed as of r21208.
>>>>>>
>>>>>> Thanks for reporting it!
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
>>>>>>
>>>>>> Although removing this check solves problem of having more slots in
>>>>>>> rankfile than necessary, there is another problem.
>>>>>>>
>>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
>>>>>>>
>>>>>>>
>>>>>>> hostfile:
>>>>>>>
>>>>>>> node01
>>>>>>> node01
>>>>>>> node02
>>>>>>> node02
>>>>>>>
>>>>>>> rankfile:
>>>>>>>
>>>>>>> rank 0=node01 slot=1
>>>>>>> rank 1=node01 slot=0
>>>>>>> rank 2=node02 slot=1
>>>>>>> rank 3=node02 slot=0
>>>>>>>
>>>>>>> mpirun -np 4 ./something
>>>>>>>
>>>>>>> complains with:
>>>>>>>
>>>>>>> "There are not enough slots available in the system to satisfy the 4
>>>>>>> slots
>>>>>>> that were requested by the application"
>>>>>>>
>>>>>>> but "mpirun -np 3 ./something" will work though. It works, when you
>>>>>>> ask for 1 CPU less. And the same behavior in any case (shared nodes,
>>>>>>> non-shared nodes, multi-node)
>>>>>>>
>>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and all
>>>>>>> affinities set as it requested in rankfile, there is no oversubscription.
>>>>>>>
>>>>>>>
>>>>>>> Anton.
>>>>>>>
>>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Ah - thx for catching that, I'll remove that check. It no longer is
>>>>>>>> required.
>>>>>>>>
>>>>>>>> Thx!
>>>>>>>>
>>>>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <
>>>>>>>> lenny.verkhovsky_at_[hidden]> wrote:
>>>>>>>> According to the code it does cares.
>>>>>>>>
>>>>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
>>>>>>>>
>>>>>>>> ival = orte_rmaps_rank_file_value.ival;
>>>>>>>> if ( ival > (np-1) ) {
>>>>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,
>>>>>>>> ival, rankfile);
>>>>>>>> rc = ORTE_ERR_BAD_PARAM;
>>>>>>>> goto unlock;
>>>>>>>> }
>>>>>>>>
>>>>>>>> If I remember correctly, I used an array to map ranks, and since the
>>>>>>>> length of array is NP, maximum index must be less than np, so if you have
>>>>>>>> the number of rank > NP, you have no place to put it inside array.
>>>>>>>>
>>>>>>>> "Likewise, if you have more procs than the rankfile specifies, we
>>>>>>>> map the additional procs either byslot (default) or bynode (if you specify
>>>>>>>> that option). So the rankfile doesn't need to contain an entry for every
>>>>>>>> proc." - Correct point.
>>>>>>>>
>>>>>>>>
>>>>>>>> Lenny.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 5/5/09, Ralph Castain <rhc_at_[hidden]> wrote: Sorry Lenny, but
>>>>>>>> that isn't correct. The rankfile mapper doesn't care if the rankfile
>>>>>>>> contains additional info - it only maps up to the number of processes, and
>>>>>>>> ignores anything beyond that number. So there is no need to remove the
>>>>>>>> additional info.
>>>>>>>>
>>>>>>>> Likewise, if you have more procs than the rankfile specifies, we map
>>>>>>>> the additional procs either byslot (default) or bynode (if you specify that
>>>>>>>> option). So the rankfile doesn't need to contain an entry for every proc.
>>>>>>>>
>>>>>>>> Just don't want to confuse folks.
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <
>>>>>>>> lenny.verkhovsky_at_[hidden]> wrote:
>>>>>>>> Hi,
>>>>>>>> maximum rank number must be less then np.
>>>>>>>> if np=1 then there is only rank 0 in the system, so rank 1 is
>>>>>>>> invalid.
>>>>>>>> please remove "rank 1=node2 slot=*" from the rankfile
>>>>>>>> Best regards,
>>>>>>>> Lenny.
>>>>>>>>
>>>>>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot <
>>>>>>>> geopignot_at_[hidden]> wrote:
>>>>>>>> Hi ,
>>>>>>>>
>>>>>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my
>>>>>>>> command doesn't work
>>>>>>>>
>>>>>>>> cat rankf:
>>>>>>>> rank 0=node1 slot=*
>>>>>>>> rank 1=node2 slot=*
>>>>>>>>
>>>>>>>> cat hostf:
>>>>>>>> node1 slots=2
>>>>>>>> node2 slots=2
>>>>>>>>
>>>>>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1
>>>>>>>> hostname : --host node2 -n 1 hostname
>>>>>>>>
>>>>>>>> Error, invalid rank (1) in the rankfile (rankf)
>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>>> rmaps_rank_file.c at line 403
>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>>> base/rmaps_base_map_job.c at line 86
>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>>> base/plm_base_launch_support.c at line 86
>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>>>>>>> plm_rsh_module.c at line 1016
>>>>>>>>
>>>>>>>>
>>>>>>>> Ralph, could you tell me if my command syntax is correct or not ? if
>>>>>>>> not, give me the expected one ?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Geoffroy
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009/4/30 Geoffroy Pignot <geopignot_at_[hidden]>
>>>>>>>>
>>>>>>>> Immediately Sir !!! :)
>>>>>>>>
>>>>>>>> Thanks again Ralph
>>>>>>>>
>>>>>>>> Geoffroy
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> Message: 2
>>>>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>>>>>>>> From: Ralph Castain <rhc_at_[hidden]>
>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>>> Message-ID:
>>>>>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f7e6_at_[hidden]>
>>>>>>>> Content-Type: text/plain; charset="iso-8859-1"
>>>>>>>>
>>>>>>>> I believe this is fixed now in our development trunk - you can
>>>>>>>> download any
>>>>>>>> tarball starting from last night and give it a try, if you like. Any
>>>>>>>> feedback would be appreciated.
>>>>>>>>
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>> Ah now, I didn't say it -worked-, did I? :-)
>>>>>>>>
>>>>>>>> Clearly a bug exists in the program. I'll try to take a look at it
>>>>>>>> (if Lenny
>>>>>>>> doesn't get to it first), but it won't be until later in the week.
>>>>>>>>
>>>>>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>>>>>>>
>>>>>>>> I agree with you Ralph , and that 's what I expect from openmpi but
>>>>>>>> my
>>>>>>>> second example shows that it's not working
>>>>>>>>
>>>>>>>> cat hostfile.0
>>>>>>>> r011n002 slots=4
>>>>>>>> r011n003 slots=4
>>>>>>>>
>>>>>>>> cat rankfile.0
>>>>>>>> rank 0=r011n002 slot=0
>>>>>>>> rank 1=r011n003 slot=1
>>>>>>>>
>>>>>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>>>>>>> hostname
>>>>>>>> ### CRASHED
>>>>>>>>
>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > rmaps_rank_file.c at line 404
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > base/rmaps_base_map_job.c at line 87
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > base/plm_base_launch_support.c at line 77
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > plm_rsh_module.c at line 985
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>>>> > attempting to
>>>>>>>> > > launch so we are aborting.
>>>>>>>> > >
>>>>>>>> > > There may be more information reported by the environment (see
>>>>>>>> > above).
>>>>>>>> > >
>>>>>>>> > > This may be because the daemon was unable to find all the needed
>>>>>>>> > shared
>>>>>>>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>>>>>> to
>>>>>>>> > have the
>>>>>>>> > > location of the shared libraries on the remote nodes and this
>>>>>>>> will
>>>>>>>> > > automatically be forwarded to the remote nodes.
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > orterun noticed that the job aborted, but has no info as to the
>>>>>>>> > process
>>>>>>>> > > that caused that situation.
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > orterun: clean termination accomplished
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Message: 4
>>>>>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600
>>>>>>>> From: Ralph Castain <rhc_at_[hidden]>
>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>>> Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
>>>>>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>>>>>> DelSp="yes"
>>>>>>>>
>>>>>>>> The rankfile cuts across the entire job - it isn't applied on an
>>>>>>>> app_context basis. So the ranks in your rankfile must correspond to
>>>>>>>> the eventual rank of each process in the cmd line.
>>>>>>>>
>>>>>>>> Unfortunately, that means you have to count ranks. In your case, you
>>>>>>>> only have four, so that makes life easier. Your rankfile would look
>>>>>>>> something like this:
>>>>>>>>
>>>>>>>> rank 0=r001n001 slot=0
>>>>>>>> rank 1=r001n002 slot=1
>>>>>>>> rank 2=r001n001 slot=1
>>>>>>>> rank 3=r001n002 slot=2
>>>>>>>>
>>>>>>>> HTH
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>>>>>>>>
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > I agree that my examples are not very clear. What I want to do is
>>>>>>>> to
>>>>>>>> > launch a multiexes application (masters-slaves) and benefit from
>>>>>>>> the
>>>>>>>> > processor affinity.
>>>>>>>> > Could you show me how to convert this command , using -rf option
>>>>>>>> > (whatever the affinity is)
>>>>>>>> >
>>>>>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host
>>>>>>>> r001n002
>>>>>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>>>>>>>> > host r001n002 slave.x options4
>>>>>>>> >
>>>>>>>> > Thanks for your help
>>>>>>>> >
>>>>>>>> > Geoffroy
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Message: 2
>>>>>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300
>>>>>>>> > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
>>>>>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>>>>>>> > To: Open MPI Users <users_at_[hidden]>
>>>>>>>> > Message-ID:
>>>>>>>> > <
>>>>>>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]>
>>>>>>>> > Content-Type: text/plain; charset="iso-8859-1"
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1
>>>>>>>> > defined,
>>>>>>>> > while n=1, which means only rank 0 is present and can be
>>>>>>>> allocated.
>>>>>>>> >
>>>>>>>> > NP must be >= the largest rank in rankfile.
>>>>>>>> >
>>>>>>>> > What exactly are you trying to do ?
>>>>>>>> >
>>>>>>>> > I tried to recreate your seqv but all I got was
>>>>>>>> >
>>>>>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>>>>>>>> > hostfile.0
>>>>>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>>>>>>>> > [witch19:30798] mca: base: component_find: paffinity
>>>>>>>> > "mca_paffinity_linux"
>>>>>>>> > uses an MCA interface that is not recognized (component MCA v1.0.0
>>>>>>>> !=
>>>>>>>> > supported MCA v2.0.0) -- ignored
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > It looks like opal_init failed for some reason; your parallel
>>>>>>>> > process is
>>>>>>>> > likely to abort. There are many reasons that a parallel process
>>>>>>>> can
>>>>>>>> > fail during opal_init; some of which are due to configuration or
>>>>>>>> > environment problems. This failure appears to be an internal
>>>>>>>> failure;
>>>>>>>> > here's some additional information (which may only be relevant to
>>>>>>>> an
>>>>>>>> > Open MPI developer):
>>>>>>>> >
>>>>>>>> > opal_carto_base_select failed
>>>>>>>> > --> Returned value -13 instead of OPAL_SUCCESS
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>>>>>>> file
>>>>>>>> > ../../orte/runtime/orte_init.c at line 78
>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>>>>>>> file
>>>>>>>> > ../../orte/orted/orted_main.c at line 344
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > A daemon (pid 11629) died unexpectedly with status 243 while
>>>>>>>> > attempting
>>>>>>>> > to launch so we are aborting.
>>>>>>>> >
>>>>>>>> > There may be more information reported by the environment (see
>>>>>>>> above).
>>>>>>>> >
>>>>>>>> > This may be because the daemon was unable to find all the needed
>>>>>>>> > shared
>>>>>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>>>>>>> > have the
>>>>>>>> > location of the shared libraries on the remote nodes and this will
>>>>>>>> > automatically be forwarded to the remote nodes.
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > mpirun noticed that the job aborted, but has no info as to the
>>>>>>>> process
>>>>>>>> > that caused that situation.
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > mpirun: clean termination accomplished
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Lenny.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
>>>>>>>> > >
>>>>>>>> > > Hi ,
>>>>>>>> > >
>>>>>>>> > > I am currently testing the process affinity capabilities of
>>>>>>>> > openmpi and I
>>>>>>>> > > would like to know if the rankfile behaviour I will describe
>>>>>>>> below
>>>>>>>> > is normal
>>>>>>>> > > or not ?
>>>>>>>> > >
>>>>>>>> > > cat hostfile.0
>>>>>>>> > > r011n002 slots=4
>>>>>>>> > > r011n003 slots=4
>>>>>>>> > >
>>>>>>>> > > cat rankfile.0
>>>>>>>> > > rank 0=r011n002 slot=0
>>>>>>>> > > rank 1=r011n003 slot=1
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>> ##################################################################################
>>>>>>>> > >
>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ###
>>>>>>>> OK
>>>>>>>> > > r011n002
>>>>>>>> > > r011n003
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>> ##################################################################################
>>>>>>>> > > but
>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>>>>>>> > hostname
>>>>>>>> > > ### CRASHED
>>>>>>>> > > *
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > rmaps_rank_file.c at line 404
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > base/rmaps_base_map_job.c at line 87
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > base/plm_base_launch_support.c at line 77
>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>>>>>>> file
>>>>>>>> > > plm_rsh_module.c at line 985
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>>>>>>> > attempting to
>>>>>>>> > > launch so we are aborting.
>>>>>>>> > >
>>>>>>>> > > There may be more information reported by the environment (see
>>>>>>>> > above).
>>>>>>>> > >
>>>>>>>> > > This may be because the daemon was unable to find all the needed
>>>>>>>> > shared
>>>>>>>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>>>>>>>> to
>>>>>>>> > have the
>>>>>>>> > > location of the shared libraries on the remote nodes and this
>>>>>>>> will
>>>>>>>> > > automatically be forwarded to the remote nodes.
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > orterun noticed that the job aborted, but has no info as to the
>>>>>>>> > process
>>>>>>>> > > that caused that situation.
>>>>>>>> > >
>>>>>>>> >
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> > > orterun: clean termination accomplished
>>>>>>>> > > *
>>>>>>>> > > It seems that the rankfile option is not propagted to the second
>>>>>>>> > command
>>>>>>>> > > line ; there is no global understanding of the ranking inside a
>>>>>>>> > mpirun
>>>>>>>> > > command.
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>> ##################################################################################
>>>>>>>> > >
>>>>>>>> > > Assuming that , I tried to provide a rankfile to each command
>>>>>>>> line:
>>>>>>>> > >
>>>>>>>> > > cat rankfile.0
>>>>>>>> > > rank 0=r011n002 slot=0
>>>>>>>> > >
>>>>>>>> > > cat rankfile.1
>>>>>>>> > > rank 0=r011n003 slot=1
>>>>>>>> > >
>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
>>>>>>>> > rankfile.1
>>>>>>>> > > -n 1 hostname ### CRASHED
>>>>>>>> > > *[r011n002:28778] *** Process received signal ***
>>>>>>>> > > [r011n002:28778] Signal: Segmentation fault (11)
>>>>>>>> > > [r011n002:28778] Signal code: Address not mapped (1)
>>>>>>>> > > [r011n002:28778] Failing at address: 0x34
>>>>>>>> > > [r011n002:28778] [ 0] [0xffffe600]
>>>>>>>> > > [r011n002:28778] [ 1]
>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>>>>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>>>>>>>> > > [0x5557decd]
>>>>>>>> > > [r011n002:28778] [ 2]
>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>>>>>>> > 0(orte_plm_base_launch_apps+0x117)
>>>>>>>> > > [0x555842a7]
>>>>>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
>>>>>>>> > mca_plm_rsh.so
>>>>>>>> > > [0x556098c0]
>>>>>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>>>>>> > [0x804aa27]
>>>>>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>>>>>> > [0x804a022]
>>>>>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
>>>>>>>> > [0x9f1dec]
>>>>>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>>>>>>> > [0x8049f71]
>>>>>>>> > > [r011n002:28778] *** End of error message ***
>>>>>>>> > > Segmentation fault (core dumped)*
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > I hope that I've found a bug because it would be very important
>>>>>>>> > for me to
>>>>>>>> > > have this kind of capabiliy .
>>>>>>>> > > Launch a multiexe mpirun command line and be able to bind my
>>>>>>>> exes
>>>>>>>> > and
>>>>>>>> > > sockets together.
>>>>>>>> > >
>>>>>>>> > > Thanks in advance for your help
>>>>>>>> > >
>>>>>>>> > > Geoffroy
>>>>>>>> > _______________________________________________
>>>>>>>> > users mailing list
>>>>>>>> > users_at_[hidden]
>>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> -------------- next part --------------
>>>>>>>> HTML attachment scrubbed and removed
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> End of users Digest, Vol 1202, Issue 2
>>>>>>>> **************************************
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> -------------- next part --------------
>>>>>>>> HTML attachment scrubbed and removed
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> End of users Digest, Vol 1218, Issue 2
>>>>>>>> **************************************
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>