Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Geoffroy Pignot (geopignot_at_[hidden])
Date: 2009-07-15 08:25:11


Hi Lenny and Ralph,

I saw nothing about rankfile in the 1.3.3 press release. Does it means that
the bug fixes are not included there ??
Thanks

Geoffroy

2009/7/15 <users-request_at_[hidden]>

> Send users mailing list submissions to
> users_at_[hidden]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request_at_[hidden]
>
> You can reach the person managing the list at
> users-owner_at_[hidden]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: 1.3.1 -rf rankfile behaviour ?? (Lenny Verkhovsky)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 15 Jul 2009 15:08:39 +0300
> From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <453d39990907150508j33ffa3f0qefc0801ea40f0d34_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Same result.
> I still suspect that rankfile claims for node in small hostlist provided by
> line in the app file, and not from the hostlist provided by mpirun on HNP
> node.
> According to my suspections your proposal should not work(and it does not),
> since in appfile line I provide np=1, and 1 host, while rankfile tries to
> allocate all ranks (np=2).
>
> $orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 338
>
> if(ORTE_SUCCESS != (rc = orte_rmaps_base_get_target_nodes(&node_list,
> &num_slots, app,
>
> map->policy))) {
>
> node_list will be partial, according to app, and not full provided by
> mpirun
> cmd. If I didnt provide hostlist in the appfile line, mpirun uses local
> host
> and not hosts from the hostfile.
>
>
> Tell me if I am wrong by expecting the following behaivor
>
> I provide to mpirun NP, full_hostlist, full_rankfile, appfile
> I provide in appfile only partial NP and partial hostlist.
> and it works.
>
> Currently, in order to get it working I need to provide full hostlist in
> the
> appfile. Which is quit a problematic.
>
>
> $mpirun -np 2 -rf rankfile -app appfile
> --------------------------------------------------------------------------
> Rankfile claimed host +n1 by index that is bigger than number of allocated
> hosts.
> --------------------------------------------------------------------------
> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>
>
> Thanks
> Lenny.
>
>
> On Wed, Jul 15, 2009 at 2:02 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> > Try your "not working" example without the -H on the mpirun cmd line -
> > i.e.,, just use "mpirun -np 2 -rf rankfile -app appfile". Does that work?
> > Sorry to have to keep asking you to try things - I don't have a setup
> here
> > where I can test this as everything is RM managed.
> >
> >
> > On Jul 15, 2009, at 12:09 AM, Lenny Verkhovsky wrote:
> >
> >
> > Thanks Ralph, after playing with prefixes it worked,
> >
> > I still have a problem running app file with rankfile, by providing full
> > hostlist in mpirun command and not in app file.
> > Is is planned behaviour, or it can be fixed ?
> >
> > See Working example:
> >
> > $cat rankfile
> > rank 0=+n1 slot=0
> > rank 1=+n0 slot=0
> > $cat appfile
> > -np 1 -H witch1,witch2 ./hello_world
> > -np 1 -H witch1,witch2 ./hello_world
> >
> > $mpirun -rf rankfile -app appfile
> > Hello world! I'm 1 of 2 on witch1
> > Hello world! I'm 0 of 2 on witch2
> >
> > See NOT working example:
> >
> > $cat appfile
> > -np 1 -H witch1 ./hello_world
> > -np 1 -H witch2 ./hello_world
> > $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
> >
> --------------------------------------------------------------------------
> > Rankfile claimed host +n1 by index that is bigger than number of
> allocated
> > hosts.
> >
> --------------------------------------------------------------------------
> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
> > [dellix7:16405] [[24080,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
> >
> >
> >
> > On Wed, Jul 15, 2009 at 6:58 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> >
> >> Took a deeper look into this, and I think that your first guess was
> >> correct.
> >> When we changed hostfile and -host to be per-app-context options, it
> >> became necessary for you to put that info in the appfile itself. So try
> >> adding it there. What you would need in your appfile is the following:
> >>
> >> -np 1 -H witch1 hostname
> >> -np 1 -H witch2 hostname
> >>
> >> That should get you what you want.
> >> Ralph
> >>
> >> On Jul 14, 2009, at 10:29 AM, Lenny Verkhovsky wrote:
> >>
> >> No, it's not working as I expect , unless I expect somthing wrong .
> >> ( sorry for the long PATH, I needed to provide it )
> >>
> >>
> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
> >>
> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
> >> -np 2 -H witch1,witch2 hostname
> >> witch1
> >> witch2
> >>
> >>
> $LD_LIBRARY_PATH=/hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/lib/
> >>
> /hpc/home/USERS/lennyb/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
> >> -np 2 -H witch1,witch2 -app appfile
> >> dellix7
> >> dellix7
> >> $cat appfile
> >> -np 1 hostname
> >> -np 1 hostname
> >>
> >>
> >> On Tue, Jul 14, 2009 at 7:08 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> >>
> >>> Run it without the appfile, just putting the apps on the cmd line -
> does
> >>> it work right then?
> >>>
> >>> On Jul 14, 2009, at 10:04 AM, Lenny Verkhovsky wrote:
> >>>
> >>> additional info
> >>> I am running mpirun on hostA, and providing hostlist with hostB and
> >>> hostC.
> >>> I expect that each application would run on hostB and hostC, but I get
> >>> all of them running on hostA.
> >>> dellix7$cat appfile
> >>> -np 1 hostname
> >>> -np 1 hostname
> >>> dellix7$mpirun -np 2 -H witch1,witch2 -app appfile
> >>> dellix7
> >>> dellix7
> >>> Thanks
> >>> Lenny.
> >>>
> >>> On Tue, Jul 14, 2009 at 4:59 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> >>>
> >>>> Strange - let me have a look at it later today. Probably something
> >>>> simple that another pair of eyes might spot.
> >>>> On Jul 14, 2009, at 7:43 AM, Lenny Verkhovsky wrote:
> >>>>
> >>>> Seems like connected problem:
> >>>> I can't use rankfile with app, even after all those fixes ( working
> with
> >>>> trunk 1.4a1r21657).
> >>>> This is my case :
> >>>>
> >>>> $cat rankfile
> >>>> rank 0=+n1 slot=0
> >>>> rank 1=+n0 slot=0
> >>>> $cat appfile
> >>>> -np 1 hostname
> >>>> -np 1 hostname
> >>>> $mpirun -np 2 -H witch1,witch2 -rf rankfile -app appfile
> >>>>
> >>>>
> --------------------------------------------------------------------------
> >>>> Rankfile claimed host +n1 by index that is bigger than number of
> >>>> allocated hosts.
> >>>>
> >>>>
> --------------------------------------------------------------------------
> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
> >>>>
> >>>>
> >>>> The problem is, that rankfile mapper tries to find an appropriate host
> >>>> in the partial ( and not full ) hostlist.
> >>>>
> >>>> Any suggestions how to fix it?
> >>>>
> >>>> Thanks
> >>>> Lenny.
> >>>>
> >>>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain <rhc_at_[hidden]
> >wrote:
> >>>>
> >>>>> Okay, I fixed this today too....r21219
> >>>>>
> >>>>>
> >>>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
> >>>>>
> >>>>> Now there is another problem :)
> >>>>>>
> >>>>>> You can try oversubscribe node. At least by 1 task.
> >>>>>> If you hostfile and rank file limit you at N procs, you can ask
> mpirun
> >>>>>> for N+1 and it wil be not rejected.
> >>>>>> Although in reality there will be N tasks.
> >>>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np
> >>>>>> 5" both works, but in both cases there are only 4 tasks. It isn't
> crucial,
> >>>>>> because there is nor real oversubscription, but there is still some
> bug
> >>>>>> which can affect something in future.
> >>>>>>
> >>>>>> --
> >>>>>> Anton Starikov.
> >>>>>>
> >>>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
> >>>>>>
> >>>>>> This is fixed as of r21208.
> >>>>>>>
> >>>>>>> Thanks for reporting it!
> >>>>>>> Ralph
> >>>>>>>
> >>>>>>>
> >>>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
> >>>>>>>
> >>>>>>> Although removing this check solves problem of having more slots in
> >>>>>>>> rankfile than necessary, there is another problem.
> >>>>>>>>
> >>>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> hostfile:
> >>>>>>>>
> >>>>>>>> node01
> >>>>>>>> node01
> >>>>>>>> node02
> >>>>>>>> node02
> >>>>>>>>
> >>>>>>>> rankfile:
> >>>>>>>>
> >>>>>>>> rank 0=node01 slot=1
> >>>>>>>> rank 1=node01 slot=0
> >>>>>>>> rank 2=node02 slot=1
> >>>>>>>> rank 3=node02 slot=0
> >>>>>>>>
> >>>>>>>> mpirun -np 4 ./something
> >>>>>>>>
> >>>>>>>> complains with:
> >>>>>>>>
> >>>>>>>> "There are not enough slots available in the system to satisfy the
> 4
> >>>>>>>> slots
> >>>>>>>> that were requested by the application"
> >>>>>>>>
> >>>>>>>> but "mpirun -np 3 ./something" will work though. It works, when
> you
> >>>>>>>> ask for 1 CPU less. And the same behavior in any case (shared
> nodes,
> >>>>>>>> non-shared nodes, multi-node)
> >>>>>>>>
> >>>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and
> all
> >>>>>>>> affinities set as it requested in rankfile, there is no
> oversubscription.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Anton.
> >>>>>>>>
> >>>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
> >>>>>>>>
> >>>>>>>> Ah - thx for catching that, I'll remove that check. It no longer
> is
> >>>>>>>>> required.
> >>>>>>>>>
> >>>>>>>>> Thx!
> >>>>>>>>>
> >>>>>>>>> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <
> >>>>>>>>> lenny.verkhovsky_at_[hidden]> wrote:
> >>>>>>>>> According to the code it does cares.
> >>>>>>>>>
> >>>>>>>>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
> >>>>>>>>>
> >>>>>>>>> ival = orte_rmaps_rank_file_value.ival;
> >>>>>>>>> if ( ival > (np-1) ) {
> >>>>>>>>> orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true,
> >>>>>>>>> ival, rankfile);
> >>>>>>>>> rc = ORTE_ERR_BAD_PARAM;
> >>>>>>>>> goto unlock;
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> If I remember correctly, I used an array to map ranks, and since
> >>>>>>>>> the length of array is NP, maximum index must be less than np, so
> if you
> >>>>>>>>> have the number of rank > NP, you have no place to put it inside
> array.
> >>>>>>>>>
> >>>>>>>>> "Likewise, if you have more procs than the rankfile specifies, we
> >>>>>>>>> map the additional procs either byslot (default) or bynode (if
> you specify
> >>>>>>>>> that option). So the rankfile doesn't need to contain an entry
> for every
> >>>>>>>>> proc." - Correct point.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Lenny.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 5/5/09, Ralph Castain <rhc_at_[hidden]> wrote: Sorry Lenny,
> >>>>>>>>> but that isn't correct. The rankfile mapper doesn't care if the
> rankfile
> >>>>>>>>> contains additional info - it only maps up to the number of
> processes, and
> >>>>>>>>> ignores anything beyond that number. So there is no need to
> remove the
> >>>>>>>>> additional info.
> >>>>>>>>>
> >>>>>>>>> Likewise, if you have more procs than the rankfile specifies, we
> >>>>>>>>> map the additional procs either byslot (default) or bynode (if
> you specify
> >>>>>>>>> that option). So the rankfile doesn't need to contain an entry
> for every
> >>>>>>>>> proc.
> >>>>>>>>>
> >>>>>>>>> Just don't want to confuse folks.
> >>>>>>>>> Ralph
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <
> >>>>>>>>> lenny.verkhovsky_at_[hidden]> wrote:
> >>>>>>>>> Hi,
> >>>>>>>>> maximum rank number must be less then np.
> >>>>>>>>> if np=1 then there is only rank 0 in the system, so rank 1 is
> >>>>>>>>> invalid.
> >>>>>>>>> please remove "rank 1=node2 slot=*" from the rankfile
> >>>>>>>>> Best regards,
> >>>>>>>>> Lenny.
> >>>>>>>>>
> >>>>>>>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot <
> >>>>>>>>> geopignot_at_[hidden]> wrote:
> >>>>>>>>> Hi ,
> >>>>>>>>>
> >>>>>>>>> I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately
> my
> >>>>>>>>> command doesn't work
> >>>>>>>>>
> >>>>>>>>> cat rankf:
> >>>>>>>>> rank 0=node1 slot=*
> >>>>>>>>> rank 1=node2 slot=*
> >>>>>>>>>
> >>>>>>>>> cat hostf:
> >>>>>>>>> node1 slots=2
> >>>>>>>>> node2 slots=2
> >>>>>>>>>
> >>>>>>>>> mpirun --rankfile rankf --hostfile hostf --host node1 -n 1
> >>>>>>>>> hostname : --host node2 -n 1 hostname
> >>>>>>>>>
> >>>>>>>>> Error, invalid rank (1) in the rankfile (rankf)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>>>>>>>> file rmaps_rank_file.c at line 403
> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>>>>>>>> file base/rmaps_base_map_job.c at line 86
> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>>>>>>>> file base/plm_base_launch_support.c at line 86
> >>>>>>>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in
> >>>>>>>>> file plm_rsh_module.c at line 1016
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Ralph, could you tell me if my command syntax is correct or not ?
> >>>>>>>>> if not, give me the expected one ?
> >>>>>>>>>
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>> Geoffroy
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2009/4/30 Geoffroy Pignot <geopignot_at_[hidden]>
> >>>>>>>>>
> >>>>>>>>> Immediately Sir !!! :)
> >>>>>>>>>
> >>>>>>>>> Thanks again Ralph
> >>>>>>>>>
> >>>>>>>>> Geoffroy
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ------------------------------
> >>>>>>>>>
> >>>>>>>>> Message: 2
> >>>>>>>>> Date: Thu, 30 Apr 2009 06:45:39 -0600
> >>>>>>>>> From: Ralph Castain <rhc_at_[hidden]>
> >>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> >>>>>>>>> To: Open MPI Users <users_at_[hidden]>
> >>>>>>>>> Message-ID:
> >>>>>>>>> <71d2d8cc0904300545v61a42fe1k50086d2704d0f7e6_at_[hidden]>
> >>>>>>>>> Content-Type: text/plain; charset="iso-8859-1"
> >>>>>>>>>
> >>>>>>>>> I believe this is fixed now in our development trunk - you can
> >>>>>>>>> download any
> >>>>>>>>> tarball starting from last night and give it a try, if you like.
> >>>>>>>>> Any
> >>>>>>>>> feedback would be appreciated.
> >>>>>>>>>
> >>>>>>>>> Ralph
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
> >>>>>>>>>
> >>>>>>>>> Ah now, I didn't say it -worked-, did I? :-)
> >>>>>>>>>
> >>>>>>>>> Clearly a bug exists in the program. I'll try to take a look at
> it
> >>>>>>>>> (if Lenny
> >>>>>>>>> doesn't get to it first), but it won't be until later in the
> week.
> >>>>>>>>>
> >>>>>>>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
> >>>>>>>>>
> >>>>>>>>> I agree with you Ralph , and that 's what I expect from openmpi
> but
> >>>>>>>>> my
> >>>>>>>>> second example shows that it's not working
> >>>>>>>>>
> >>>>>>>>> cat hostfile.0
> >>>>>>>>> r011n002 slots=4
> >>>>>>>>> r011n003 slots=4
> >>>>>>>>>
> >>>>>>>>> cat rankfile.0
> >>>>>>>>> rank 0=r011n002 slot=0
> >>>>>>>>> rank 1=r011n003 slot=1
> >>>>>>>>>
> >>>>>>>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> >>>>>>>>> hostname
> >>>>>>>>> ### CRASHED
> >>>>>>>>>
> >>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > rmaps_rank_file.c at line 404
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > base/rmaps_base_map_job.c at line 87
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > base/plm_base_launch_support.c at line 77
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > plm_rsh_module.c at line 985
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
> >>>>>>>>> > attempting to
> >>>>>>>>> > > launch so we are aborting.
> >>>>>>>>> > >
> >>>>>>>>> > > There may be more information reported by the environment
> (see
> >>>>>>>>> > above).
> >>>>>>>>> > >
> >>>>>>>>> > > This may be because the daemon was unable to find all the
> >>>>>>>>> needed
> >>>>>>>>> > shared
> >>>>>>>>> > > libraries on the remote node. You may set your
> LD_LIBRARY_PATH
> >>>>>>>>> to
> >>>>>>>>> > have the
> >>>>>>>>> > > location of the shared libraries on the remote nodes and this
> >>>>>>>>> will
> >>>>>>>>> > > automatically be forwarded to the remote nodes.
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > orterun noticed that the job aborted, but has no info as to
> the
> >>>>>>>>> > process
> >>>>>>>>> > > that caused that situation.
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > orterun: clean termination accomplished
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Message: 4
> >>>>>>>>> Date: Tue, 14 Apr 2009 06:55:58 -0600
> >>>>>>>>> From: Ralph Castain <rhc_at_[hidden]>
> >>>>>>>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> >>>>>>>>> To: Open MPI Users <users_at_[hidden]>
> >>>>>>>>> Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
> >>>>>>>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> >>>>>>>>> DelSp="yes"
> >>>>>>>>>
> >>>>>>>>> The rankfile cuts across the entire job - it isn't applied on an
> >>>>>>>>> app_context basis. So the ranks in your rankfile must correspond
> to
> >>>>>>>>> the eventual rank of each process in the cmd line.
> >>>>>>>>>
> >>>>>>>>> Unfortunately, that means you have to count ranks. In your case,
> >>>>>>>>> you
> >>>>>>>>> only have four, so that makes life easier. Your rankfile would
> look
> >>>>>>>>> something like this:
> >>>>>>>>>
> >>>>>>>>> rank 0=r001n001 slot=0
> >>>>>>>>> rank 1=r001n002 slot=1
> >>>>>>>>> rank 2=r001n001 slot=1
> >>>>>>>>> rank 3=r001n002 slot=2
> >>>>>>>>>
> >>>>>>>>> HTH
> >>>>>>>>> Ralph
> >>>>>>>>>
> >>>>>>>>> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
> >>>>>>>>>
> >>>>>>>>> > Hi,
> >>>>>>>>> >
> >>>>>>>>> > I agree that my examples are not very clear. What I want to do
> is
> >>>>>>>>> to
> >>>>>>>>> > launch a multiexes application (masters-slaves) and benefit
> from
> >>>>>>>>> the
> >>>>>>>>> > processor affinity.
> >>>>>>>>> > Could you show me how to convert this command , using -rf
> option
> >>>>>>>>> > (whatever the affinity is)
> >>>>>>>>> >
> >>>>>>>>> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host
> >>>>>>>>> r001n002
> >>>>>>>>> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1
> -
> >>>>>>>>> > host r001n002 slave.x options4
> >>>>>>>>> >
> >>>>>>>>> > Thanks for your help
> >>>>>>>>> >
> >>>>>>>>> > Geoffroy
> >>>>>>>>> >
> >>>>>>>>> >
> >>>>>>>>> >
> >>>>>>>>> >
> >>>>>>>>> >
> >>>>>>>>> > Message: 2
> >>>>>>>>> > Date: Sun, 12 Apr 2009 18:26:35 +0300
> >>>>>>>>> > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
> >>>>>>>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> >>>>>>>>> > To: Open MPI Users <users_at_[hidden]>
> >>>>>>>>> > Message-ID:
> >>>>>>>>> > <
> >>>>>>>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]>
> >>>>>>>>> > Content-Type: text/plain; charset="iso-8859-1"
> >>>>>>>>> >
> >>>>>>>>> > Hi,
> >>>>>>>>> >
> >>>>>>>>> > The first "crash" is OK, since your rankfile has ranks 0 and 1
> >>>>>>>>> > defined,
> >>>>>>>>> > while n=1, which means only rank 0 is present and can be
> >>>>>>>>> allocated.
> >>>>>>>>> >
> >>>>>>>>> > NP must be >= the largest rank in rankfile.
> >>>>>>>>> >
> >>>>>>>>> > What exactly are you trying to do ?
> >>>>>>>>> >
> >>>>>>>>> > I tried to recreate your seqv but all I got was
> >>>>>>>>> >
> >>>>>>>>> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun
> --hostfile
> >>>>>>>>> > hostfile.0
> >>>>>>>>> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> >>>>>>>>> > [witch19:30798] mca: base: component_find: paffinity
> >>>>>>>>> > "mca_paffinity_linux"
> >>>>>>>>> > uses an MCA interface that is not recognized (component MCA
> >>>>>>>>> v1.0.0 !=
> >>>>>>>>> > supported MCA v2.0.0) -- ignored
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > It looks like opal_init failed for some reason; your parallel
> >>>>>>>>> > process is
> >>>>>>>>> > likely to abort. There are many reasons that a parallel process
> >>>>>>>>> can
> >>>>>>>>> > fail during opal_init; some of which are due to configuration
> or
> >>>>>>>>> > environment problems. This failure appears to be an internal
> >>>>>>>>> failure;
> >>>>>>>>> > here's some additional information (which may only be relevant
> to
> >>>>>>>>> an
> >>>>>>>>> > Open MPI developer):
> >>>>>>>>> >
> >>>>>>>>> > opal_carto_base_select failed
> >>>>>>>>> > --> Returned value -13 instead of OPAL_SUCCESS
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
> in
> >>>>>>>>> file
> >>>>>>>>> > ../../orte/runtime/orte_init.c at line 78
> >>>>>>>>> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found
> in
> >>>>>>>>> file
> >>>>>>>>> > ../../orte/orted/orted_main.c at line 344
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > A daemon (pid 11629) died unexpectedly with status 243 while
> >>>>>>>>> > attempting
> >>>>>>>>> > to launch so we are aborting.
> >>>>>>>>> >
> >>>>>>>>> > There may be more information reported by the environment (see
> >>>>>>>>> above).
> >>>>>>>>> >
> >>>>>>>>> > This may be because the daemon was unable to find all the
> needed
> >>>>>>>>> > shared
> >>>>>>>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH
> to
> >>>>>>>>> > have the
> >>>>>>>>> > location of the shared libraries on the remote nodes and this
> >>>>>>>>> will
> >>>>>>>>> > automatically be forwarded to the remote nodes.
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > mpirun noticed that the job aborted, but has no info as to the
> >>>>>>>>> process
> >>>>>>>>> > that caused that situation.
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > mpirun: clean termination accomplished
> >>>>>>>>> >
> >>>>>>>>> >
> >>>>>>>>> > Lenny.
> >>>>>>>>> >
> >>>>>>>>> >
> >>>>>>>>> > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
> >>>>>>>>> > >
> >>>>>>>>> > > Hi ,
> >>>>>>>>> > >
> >>>>>>>>> > > I am currently testing the process affinity capabilities of
> >>>>>>>>> > openmpi and I
> >>>>>>>>> > > would like to know if the rankfile behaviour I will describe
> >>>>>>>>> below
> >>>>>>>>> > is normal
> >>>>>>>>> > > or not ?
> >>>>>>>>> > >
> >>>>>>>>> > > cat hostfile.0
> >>>>>>>>> > > r011n002 slots=4
> >>>>>>>>> > > r011n003 slots=4
> >>>>>>>>> > >
> >>>>>>>>> > > cat rankfile.0
> >>>>>>>>> > > rank 0=r011n002 slot=0
> >>>>>>>>> > > rank 1=r011n003 slot=1
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> >>>>>>>>>
> ##################################################################################
> >>>>>>>>> > >
> >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname
> ###
> >>>>>>>>> OK
> >>>>>>>>> > > r011n002
> >>>>>>>>> > > r011n003
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> >>>>>>>>>
> ##################################################################################
> >>>>>>>>> > > but
> >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname :
> -n
> >>>>>>>>> 1
> >>>>>>>>> > hostname
> >>>>>>>>> > > ### CRASHED
> >>>>>>>>> > > *
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > rmaps_rank_file.c at line 404
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > base/rmaps_base_map_job.c at line 87
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > base/plm_base_launch_support.c at line 77
> >>>>>>>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter
> in
> >>>>>>>>> file
> >>>>>>>>> > > plm_rsh_module.c at line 985
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > A daemon (pid unknown) died unexpectedly on signal 1 while
> >>>>>>>>> > attempting to
> >>>>>>>>> > > launch so we are aborting.
> >>>>>>>>> > >
> >>>>>>>>> > > There may be more information reported by the environment
> (see
> >>>>>>>>> > above).
> >>>>>>>>> > >
> >>>>>>>>> > > This may be because the daemon was unable to find all the
> >>>>>>>>> needed
> >>>>>>>>> > shared
> >>>>>>>>> > > libraries on the remote node. You may set your
> LD_LIBRARY_PATH
> >>>>>>>>> to
> >>>>>>>>> > have the
> >>>>>>>>> > > location of the shared libraries on the remote nodes and this
> >>>>>>>>> will
> >>>>>>>>> > > automatically be forwarded to the remote nodes.
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > orterun noticed that the job aborted, but has no info as to
> the
> >>>>>>>>> > process
> >>>>>>>>> > > that caused that situation.
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>> > > orterun: clean termination accomplished
> >>>>>>>>> > > *
> >>>>>>>>> > > It seems that the rankfile option is not propagted to the
> >>>>>>>>> second
> >>>>>>>>> > command
> >>>>>>>>> > > line ; there is no global understanding of the ranking inside
> a
> >>>>>>>>> > mpirun
> >>>>>>>>> > > command.
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> >
> >>>>>>>>>
> >>>>>>>>>
> ##################################################################################
> >>>>>>>>> > >
> >>>>>>>>> > > Assuming that , I tried to provide a rankfile to each command
> >>>>>>>>> line:
> >>>>>>>>> > >
> >>>>>>>>> > > cat rankfile.0
> >>>>>>>>> > > rank 0=r011n002 slot=0
> >>>>>>>>> > >
> >>>>>>>>> > > cat rankfile.1
> >>>>>>>>> > > rank 0=r011n003 slot=1
> >>>>>>>>> > >
> >>>>>>>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname :
> -rf
> >>>>>>>>> > rankfile.1
> >>>>>>>>> > > -n 1 hostname ### CRASHED
> >>>>>>>>> > > *[r011n002:28778] *** Process received signal ***
> >>>>>>>>> > > [r011n002:28778] Signal: Segmentation fault (11)
> >>>>>>>>> > > [r011n002:28778] Signal code: Address not mapped (1)
> >>>>>>>>> > > [r011n002:28778] Failing at address: 0x34
> >>>>>>>>> > > [r011n002:28778] [ 0] [0xffffe600]
> >>>>>>>>> > > [r011n002:28778] [ 1]
> >>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> >>>>>>>>> > 0(orte_odls_base_default_get_add_procs_data+0x55d)
> >>>>>>>>> > > [0x5557decd]
> >>>>>>>>> > > [r011n002:28778] [ 2]
> >>>>>>>>> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> >>>>>>>>> > 0(orte_plm_base_launch_apps+0x117)
> >>>>>>>>> > > [0x555842a7]
> >>>>>>>>> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
> >>>>>>>>> > mca_plm_rsh.so
> >>>>>>>>> > > [0x556098c0]
> >>>>>>>>> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> >>>>>>>>> > [0x804aa27]
> >>>>>>>>> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> >>>>>>>>> > [0x804a022]
> >>>>>>>>> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
> >>>>>>>>> > [0x9f1dec]
> >>>>>>>>> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> >>>>>>>>> > [0x8049f71]
> >>>>>>>>> > > [r011n002:28778] *** End of error message ***
> >>>>>>>>> > > Segmentation fault (core dumped)*
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> > >
> >>>>>>>>> > > I hope that I've found a bug because it would be very
> important
> >>>>>>>>> > for me to
> >>>>>>>>> > > have this kind of capabiliy .
> >>>>>>>>> > > Launch a multiexe mpirun command line and be able to bind my
> >>>>>>>>> exes
> >>>>>>>>> > and
> >>>>>>>>> > > sockets together.
> >>>>>>>>> > >
> >>>>>>>>> > > Thanks in advance for your help
> >>>>>>>>> > >
> >>>>>>>>> > > Geoffroy
> >>>>>>>>> > _______________________________________________
> >>>>>>>>> > users mailing list
> >>>>>>>>> > users_at_[hidden]
> >>>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> -------------- next part --------------
> >>>>>>>>> HTML attachment scrubbed and removed
> >>>>>>>>>
> >>>>>>>>> ------------------------------
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> End of users Digest, Vol 1202, Issue 2
> >>>>>>>>> **************************************
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> -------------- next part --------------
> >>>>>>>>> HTML attachment scrubbed and removed
> >>>>>>>>>
> >>>>>>>>> ------------------------------
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> End of users Digest, Vol 1218, Issue 2
> >>>>>>>>> **************************************
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> users_at_[hidden]
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1289, Issue 3
> **************************************
>