Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-20 13:30:26


I'm afraid this is a more extensive rewrite than I had hoped - the revisions
are most unlikely to make it for 1.3.2. Looks like it will be 1.3.3 at the
earliest.

Ralph

On Mon, Apr 20, 2009 at 7:50 AM, Lenny Verkhovsky <
lenny.verkhovsky_at_[hidden]> wrote:

> Me too, sorry, it definately seems like a bug. Somewere in the code
> probably undefined variable.
> I just never tested this code with such "bizzare" command line :)
>
> Lenny.
>
> On Mon, Apr 20, 2009 at 4:08 PM, Geoffroy Pignot <geopignot_at_[hidden]>wrote:
>
>> Thanks,
>>
>> I am not in a hurry but it would be nice if I could benefit from this
>> feature in the next release.
>> Regards
>>
>> Geoffroy
>>
>>
>>
>> 2009/4/20 <users-request_at_[hidden]>
>>
>>> Send users mailing list submissions to
>>> users_at_[hidden]
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> or, via email, send a message with subject or body 'help' to
>>> users-request_at_[hidden]
>>>
>>> You can reach the person managing the list at
>>> users-owner_at_[hidden]
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of users digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>> 1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Mon, 20 Apr 2009 05:59:52 -0600
>>> From: Ralph Castain <rhc_at_[hidden]>
>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users <users_at_[hidden]>
>>> Message-ID: <6378A8C1-1763-4A1C-ABCA-C6FCC36054E6_at_[hidden]>
>>>
>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>> DelSp="yes"
>>>
>>> Honestly haven't had time to look at it yet - hopefully in the next
>>> couple of days...
>>>
>>> Sorry for delay
>>>
>>>
>>> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:
>>>
>>> > Do you have any news about this bug.
>>> > Thanks
>>> >
>>> > Geoffroy
>>> >
>>> >
>>> > Message: 1
>>> > Date: Tue, 14 Apr 2009 07:57:44 -0600
>>> > From: Ralph Castain <rhc_at_[hidden]>
>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > To: Open MPI Users <users_at_[hidden]>
>>> > Message-ID: <BEB90473-0747-43BF-A1E9-6FA4E77778D7_at_[hidden]>
>>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>> > DelSp="yes"
>>> >
>>> > Ah now, I didn't say it -worked-, did I? :-)
>>> >
>>> > Clearly a bug exists in the program. I'll try to take a look at it (if
>>> > Lenny doesn't get to it first), but it won't be until later in the
>>> > week.
>>> >
>>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>> >
>>> > > I agree with you Ralph , and that 's what I expect from openmpi but
>>> > > my second example shows that it's not working
>>> > >
>>> > > cat hostfile.0
>>> > > r011n002 slots=4
>>> > > r011n003 slots=4
>>> > >
>>> > > cat rankfile.0
>>> > > rank 0=r011n002 slot=0
>>> > > rank 1=r011n003 slot=1
>>> > >
>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>> > > hostname
>>> > > ### CRASHED
>>> > >
>>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > rmaps_rank_file.c at line 404
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > base/rmaps_base_map_job.c at line 87
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > base/plm_base_launch_support.c at line 77
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > plm_rsh_module.c at line 985
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>> > > > attempting to
>>> > > > > launch so we are aborting.
>>> > > > >
>>> > > > > There may be more information reported by the environment (see
>>> > > > above).
>>> > > > >
>>> > > > > This may be because the daemon was unable to find all the needed
>>> > > > shared
>>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>>> > to
>>> > > > have the
>>> > > > > location of the shared libraries on the remote nodes and this
>>> > will
>>> > > > > automatically be forwarded to the remote nodes.
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > orterun noticed that the job aborted, but has no info as to the
>>> > > > process
>>> > > > > that caused that situation.
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > orterun: clean termination accomplished
>>> > >
>>> > >
>>> > >
>>> > > Message: 4
>>> > > Date: Tue, 14 Apr 2009 06:55:58 -0600
>>> > > From: Ralph Castain <rhc_at_[hidden]>
>>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > > To: Open MPI Users <users_at_[hidden]>
>>> > > Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
>>> > > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>> > > DelSp="yes"
>>> > >
>>> > > The rankfile cuts across the entire job - it isn't applied on an
>>> > > app_context basis. So the ranks in your rankfile must correspond to
>>> > > the eventual rank of each process in the cmd line.
>>> > >
>>> > > Unfortunately, that means you have to count ranks. In your case, you
>>> > > only have four, so that makes life easier. Your rankfile would look
>>> > > something like this:
>>> > >
>>> > > rank 0=r001n001 slot=0
>>> > > rank 1=r001n002 slot=1
>>> > > rank 2=r001n001 slot=1
>>> > > rank 3=r001n002 slot=2
>>> > >
>>> > > HTH
>>> > > Ralph
>>> > >
>>> > > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>>> > >
>>> > > > Hi,
>>> > > >
>>> > > > I agree that my examples are not very clear. What I want to do
>>> > is to
>>> > > > launch a multiexes application (masters-slaves) and benefit from
>>> > the
>>> > > > processor affinity.
>>> > > > Could you show me how to convert this command , using -rf option
>>> > > > (whatever the affinity is)
>>> > > >
>>> > > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host
>>> > r001n002
>>> > > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>>> > > > host r001n002 slave.x options4
>>> > > >
>>> > > > Thanks for your help
>>> > > >
>>> > > > Geoffroy
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > Message: 2
>>> > > > Date: Sun, 12 Apr 2009 18:26:35 +0300
>>> > > > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
>>> > > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > > > To: Open MPI Users <users_at_[hidden]>
>>> > > > Message-ID:
>>> > > > <
>>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]
>>> > >
>>> > > > Content-Type: text/plain; charset="iso-8859-1"
>>> > > >
>>> > > > Hi,
>>> > > >
>>> > > > The first "crash" is OK, since your rankfile has ranks 0 and 1
>>> > > > defined,
>>> > > > while n=1, which means only rank 0 is present and can be
>>> > allocated.
>>> > > >
>>> > > > NP must be >= the largest rank in rankfile.
>>> > > >
>>> > > > What exactly are you trying to do ?
>>> > > >
>>> > > > I tried to recreate your seqv but all I got was
>>> > > >
>>> > > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>>> > > > hostfile.0
>>> > > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>>> > > > [witch19:30798] mca: base: component_find: paffinity
>>> > > > "mca_paffinity_linux"
>>> > > > uses an MCA interface that is not recognized (component MCA
>>> > > v1.0.0 !=
>>> > > > supported MCA v2.0.0) -- ignored
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > It looks like opal_init failed for some reason; your parallel
>>> > > > process is
>>> > > > likely to abort. There are many reasons that a parallel process
>>> > can
>>> > > > fail during opal_init; some of which are due to configuration or
>>> > > > environment problems. This failure appears to be an internal
>>> > > failure;
>>> > > > here's some additional information (which may only be relevant
>>> > to an
>>> > > > Open MPI developer):
>>> > > >
>>> > > > opal_carto_base_select failed
>>> > > > --> Returned value -13 instead of OPAL_SUCCESS
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>> > > file
>>> > > > ../../orte/runtime/orte_init.c at line 78
>>> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>>> > > file
>>> > > > ../../orte/orted/orted_main.c at line 344
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > A daemon (pid 11629) died unexpectedly with status 243 while
>>> > > > attempting
>>> > > > to launch so we are aborting.
>>> > > >
>>> > > > There may be more information reported by the environment (see
>>> > > above).
>>> > > >
>>> > > > This may be because the daemon was unable to find all the needed
>>> > > > shared
>>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> > > > have the
>>> > > > location of the shared libraries on the remote nodes and this will
>>> > > > automatically be forwarded to the remote nodes.
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > mpirun noticed that the job aborted, but has no info as to the
>>> > > process
>>> > > > that caused that situation.
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > mpirun: clean termination accomplished
>>> > > >
>>> > > >
>>> > > > Lenny.
>>> > > >
>>> > > >
>>> > > > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
>>> > > > >
>>> > > > > Hi ,
>>> > > > >
>>> > > > > I am currently testing the process affinity capabilities of
>>> > > > openmpi and I
>>> > > > > would like to know if the rankfile behaviour I will describe
>>> > below
>>> > > > is normal
>>> > > > > or not ?
>>> > > > >
>>> > > > > cat hostfile.0
>>> > > > > r011n002 slots=4
>>> > > > > r011n003 slots=4
>>> > > > >
>>> > > > > cat rankfile.0
>>> > > > > rank 0=r011n002 slot=0
>>> > > > > rank 1=r011n003 slot=1
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> ##################################################################################
>>> > > > >
>>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ###
>>> > OK
>>> > > > > r011n002
>>> > > > > r011n003
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> ##################################################################################
>>> > > > > but
>>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>> > > > hostname
>>> > > > > ### CRASHED
>>> > > > > *
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > rmaps_rank_file.c at line 404
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > base/rmaps_base_map_job.c at line 87
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > base/plm_base_launch_support.c at line 77
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > plm_rsh_module.c at line 985
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while
>>> > > > attempting to
>>> > > > > launch so we are aborting.
>>> > > > >
>>> > > > > There may be more information reported by the environment (see
>>> > > > above).
>>> > > > >
>>> > > > > This may be because the daemon was unable to find all the needed
>>> > > > shared
>>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>>> > to
>>> > > > have the
>>> > > > > location of the shared libraries on the remote nodes and this
>>> > will
>>> > > > > automatically be forwarded to the remote nodes.
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > orterun noticed that the job aborted, but has no info as to the
>>> > > > process
>>> > > > > that caused that situation.
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --------------------------------------------------------------------------
>>> > > > > orterun: clean termination accomplished
>>> > > > > *
>>> > > > > It seems that the rankfile option is not propagted to the second
>>> > > > command
>>> > > > > line ; there is no global understanding of the ranking inside a
>>> > > > mpirun
>>> > > > > command.
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> ##################################################################################
>>> > > > >
>>> > > > > Assuming that , I tried to provide a rankfile to each command
>>> > > line:
>>> > > > >
>>> > > > > cat rankfile.0
>>> > > > > rank 0=r011n002 slot=0
>>> > > > >
>>> > > > > cat rankfile.1
>>> > > > > rank 0=r011n003 slot=1
>>> > > > >
>>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
>>> > > > rankfile.1
>>> > > > > -n 1 hostname ### CRASHED
>>> > > > > *[r011n002:28778] *** Process received signal ***
>>> > > > > [r011n002:28778] Signal: Segmentation fault (11)
>>> > > > > [r011n002:28778] Signal code: Address not mapped (1)
>>> > > > > [r011n002:28778] Failing at address: 0x34
>>> > > > > [r011n002:28778] [ 0] [0xffffe600]
>>> > > > > [r011n002:28778] [ 1]
>>> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>> > > > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>>> > > > > [0x5557decd]
>>> > > > > [r011n002:28778] [ 2]
>>> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>>> > > > 0(orte_plm_base_launch_apps+0x117)
>>> > > > > [0x555842a7]
>>> > > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
>>> > > > mca_plm_rsh.so
>>> > > > > [0x556098c0]
>>> > > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>> > > > [0x804aa27]
>>> > > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>> > > > [0x804a022]
>>> > > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
>>> > > > [0x9f1dec]
>>> > > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>>> > > > [0x8049f71]
>>> > > > > [r011n002:28778] *** End of error message ***
>>> > > > > Segmentation fault (core dumped)*
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > I hope that I've found a bug because it would be very important
>>> > > > for me to
>>> > > > > have this kind of capabiliy .
>>> > > > > Launch a multiexe mpirun command line and be able to bind my
>>> > exes
>>> > > > and
>>> > > > > sockets together.
>>> > > > >
>>> > > > > Thanks in advance for your help
>>> > > > >
>>> > > > > Geoffroy
>>> > > > _______________________________________________
>>> > > > users mailing list
>>> > > > users_at_[hidden]
>>> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > >
>>> > > -------------- next part --------------
>>> > > HTML attachment scrubbed and removed
>>> > >
>>> > > ------------------------------
>>> > >
>>> > > _______________________________________________
>>> > > users mailing list
>>> > > users_at_[hidden]
>>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > >
>>> > > End of users Digest, Vol 1202, Issue 2
>>> > > **************************************
>>> > >
>>> > > _______________________________________________
>>> > > users mailing list
>>> > > users_at_[hidden]
>>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> > -------------- next part --------------
>>> > HTML attachment scrubbed and removed
>>> >
>>> > ------------------------------
>>> >
>>> > Message: 2
>>> > Date: Tue, 14 Apr 2009 10:30:58 -0400
>>> > From: Prentice Bisbal <prentice_at_[hidden]>
>>> > Subject: Re: [OMPI users] PGI Fortran pthread support
>>> > To: Open MPI Users <users_at_[hidden]>
>>> > Message-ID: <49E49E22.9040502_at_[hidden]>
>>> > Content-Type: text/plain; charset=ISO-8859-1
>>> >
>>> > Orion,
>>> >
>>> > I have no trouble getting thread support during configure with PGI
>>> > 8.0-3
>>> >
>>> > Are there any other compilers in your path before the PGI compilers?
>>> > Even if the PGI compilers come first, try specifying the PGI compilers
>>> > explicitly with these environment variables (bash syntax shown):
>>> >
>>> > export CC=pgcc
>>> > export CXX=pgCC
>>> > export F77=pgf77
>>> > export FC=pgf90
>>> >
>>> > also check the value of CPPFLAGS and LDFLAGS, and make sure they are
>>> > correct for your PGI compilers.
>>> >
>>> > --
>>> > Prentice
>>> >
>>> > Orion Poplawski wrote:
>>> > > Seeing the following building openmpi 1.3.1 on CentOS 5.3 with PGI
>>> > pgf90
>>> > > 8.0-5 fortran compiler:
>>> > >
>>> > > checking if C compiler and POSIX threads work with -Kthread... no
>>> > > checking if C compiler and POSIX threads work with -kthread... no
>>> > > checking if C compiler and POSIX threads work with -pthread... yes
>>> > > checking if C++ compiler and POSIX threads work with -Kthread... no
>>> > > checking if C++ compiler and POSIX threads work with -kthread... no
>>> > > checking if C++ compiler and POSIX threads work with -pthread... yes
>>> > > checking if F77 compiler and POSIX threads work with -Kthread... no
>>> > > checking if F77 compiler and POSIX threads work with -kthread... no
>>> > > checking if F77 compiler and POSIX threads work with -pthread... no
>>> > > checking if F77 compiler and POSIX threads work with -pthreads... no
>>> > > checking if F77 compiler and POSIX threads work with -mt... no
>>> > > checking if F77 compiler and POSIX threads work with -mthreads... no
>>> > > checking if F77 compiler and POSIX threads work with -lpthreads...
>>> > no
>>> > > checking if F77 compiler and POSIX threads work with -llthread... no
>>> > > checking if F77 compiler and POSIX threads work with -lpthread... no
>>> > > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes
>>> > > checking for PTHREAD_MUTEX_ERRORCHECK... yes
>>> > > checking for working POSIX threads package... no
>>> > > checking if C compiler and Solaris threads work... no
>>> > > checking if C++ compiler and Solaris threads work... no
>>> > > checking if F77 compiler and Solaris threads work... no
>>> > > checking for working Solaris threads package... no
>>> > > checking for type of thread support... none found
>>> > >
>>> >
>>> >
>>> >
>>> > ------------------------------
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users_at_[hidden]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> > End of users Digest, Vol 1202, Issue 4
>>> > **************************************
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users_at_[hidden]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> -------------- next part --------------
>>> HTML attachment scrubbed and removed
>>>
>>> ------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> End of users Digest, Vol 1208, Issue 2
>>> **************************************
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>