Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-04-20 09:50:29


Me too, sorry, it definately seems like a bug. Somewere in the code probably
undefined variable.
I just never tested this code with such "bizzare" command line :)

Lenny.

On Mon, Apr 20, 2009 at 4:08 PM, Geoffroy Pignot <geopignot_at_[hidden]>wrote:

> Thanks,
>
> I am not in a hurry but it would be nice if I could benefit from this
> feature in the next release.
> Regards
>
> Geoffroy
>
>
>
> 2009/4/20 <users-request_at_[hidden]>
>
>> Send users mailing list submissions to
>> users_at_[hidden]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>> users-request_at_[hidden]
>>
>> You can reach the person managing the list at
>> users-owner_at_[hidden]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>>
>>
>> Today's Topics:
>>
>> 1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Mon, 20 Apr 2009 05:59:52 -0600
>> From: Ralph Castain <rhc_at_[hidden]>
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <6378A8C1-1763-4A1C-ABCA-C6FCC36054E6_at_[hidden]>
>>
>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> DelSp="yes"
>>
>> Honestly haven't had time to look at it yet - hopefully in the next
>> couple of days...
>>
>> Sorry for delay
>>
>>
>> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:
>>
>> > Do you have any news about this bug.
>> > Thanks
>> >
>> > Geoffroy
>> >
>> >
>> > Message: 1
>> > Date: Tue, 14 Apr 2009 07:57:44 -0600
>> > From: Ralph Castain <rhc_at_[hidden]>
>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > To: Open MPI Users <users_at_[hidden]>
>> > Message-ID: <BEB90473-0747-43BF-A1E9-6FA4E77778D7_at_[hidden]>
>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> > DelSp="yes"
>> >
>> > Ah now, I didn't say it -worked-, did I? :-)
>> >
>> > Clearly a bug exists in the program. I'll try to take a look at it (if
>> > Lenny doesn't get to it first), but it won't be until later in the
>> > week.
>> >
>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>> >
>> > > I agree with you Ralph , and that 's what I expect from openmpi but
>> > > my second example shows that it's not working
>> > >
>> > > cat hostfile.0
>> > > r011n002 slots=4
>> > > r011n003 slots=4
>> > >
>> > > cat rankfile.0
>> > > rank 0=r011n002 slot=0
>> > > rank 1=r011n003 slot=1
>> > >
>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> > > hostname
>> > > ### CRASHED
>> > >
>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > rmaps_rank_file.c at line 404
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > base/rmaps_base_map_job.c at line 87
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > base/plm_base_launch_support.c at line 77
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > plm_rsh_module.c at line 985
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while
>> > > > attempting to
>> > > > > launch so we are aborting.
>> > > > >
>> > > > > There may be more information reported by the environment (see
>> > > > above).
>> > > > >
>> > > > > This may be because the daemon was unable to find all the needed
>> > > > shared
>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>> > to
>> > > > have the
>> > > > > location of the shared libraries on the remote nodes and this
>> > will
>> > > > > automatically be forwarded to the remote nodes.
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > orterun noticed that the job aborted, but has no info as to the
>> > > > process
>> > > > > that caused that situation.
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > orterun: clean termination accomplished
>> > >
>> > >
>> > >
>> > > Message: 4
>> > > Date: Tue, 14 Apr 2009 06:55:58 -0600
>> > > From: Ralph Castain <rhc_at_[hidden]>
>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > > To: Open MPI Users <users_at_[hidden]>
>> > > Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
>> > > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> > > DelSp="yes"
>> > >
>> > > The rankfile cuts across the entire job - it isn't applied on an
>> > > app_context basis. So the ranks in your rankfile must correspond to
>> > > the eventual rank of each process in the cmd line.
>> > >
>> > > Unfortunately, that means you have to count ranks. In your case, you
>> > > only have four, so that makes life easier. Your rankfile would look
>> > > something like this:
>> > >
>> > > rank 0=r001n001 slot=0
>> > > rank 1=r001n002 slot=1
>> > > rank 2=r001n001 slot=1
>> > > rank 3=r001n002 slot=2
>> > >
>> > > HTH
>> > > Ralph
>> > >
>> > > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I agree that my examples are not very clear. What I want to do
>> > is to
>> > > > launch a multiexes application (masters-slaves) and benefit from
>> > the
>> > > > processor affinity.
>> > > > Could you show me how to convert this command , using -rf option
>> > > > (whatever the affinity is)
>> > > >
>> > > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host
>> > r001n002
>> > > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>> > > > host r001n002 slave.x options4
>> > > >
>> > > > Thanks for your help
>> > > >
>> > > > Geoffroy
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Message: 2
>> > > > Date: Sun, 12 Apr 2009 18:26:35 +0300
>> > > > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
>> > > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > > > To: Open MPI Users <users_at_[hidden]>
>> > > > Message-ID:
>> > > > <453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]
>> > >
>> > > > Content-Type: text/plain; charset="iso-8859-1"
>> > > >
>> > > > Hi,
>> > > >
>> > > > The first "crash" is OK, since your rankfile has ranks 0 and 1
>> > > > defined,
>> > > > while n=1, which means only rank 0 is present and can be
>> > allocated.
>> > > >
>> > > > NP must be >= the largest rank in rankfile.
>> > > >
>> > > > What exactly are you trying to do ?
>> > > >
>> > > > I tried to recreate your seqv but all I got was
>> > > >
>> > > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>> > > > hostfile.0
>> > > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>> > > > [witch19:30798] mca: base: component_find: paffinity
>> > > > "mca_paffinity_linux"
>> > > > uses an MCA interface that is not recognized (component MCA
>> > > v1.0.0 !=
>> > > > supported MCA v2.0.0) -- ignored
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > It looks like opal_init failed for some reason; your parallel
>> > > > process is
>> > > > likely to abort. There are many reasons that a parallel process
>> > can
>> > > > fail during opal_init; some of which are due to configuration or
>> > > > environment problems. This failure appears to be an internal
>> > > failure;
>> > > > here's some additional information (which may only be relevant
>> > to an
>> > > > Open MPI developer):
>> > > >
>> > > > opal_carto_base_select failed
>> > > > --> Returned value -13 instead of OPAL_SUCCESS
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>> > > file
>> > > > ../../orte/runtime/orte_init.c at line 78
>> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>> > > file
>> > > > ../../orte/orted/orted_main.c at line 344
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > A daemon (pid 11629) died unexpectedly with status 243 while
>> > > > attempting
>> > > > to launch so we are aborting.
>> > > >
>> > > > There may be more information reported by the environment (see
>> > > above).
>> > > >
>> > > > This may be because the daemon was unable to find all the needed
>> > > > shared
>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> > > > have the
>> > > > location of the shared libraries on the remote nodes and this will
>> > > > automatically be forwarded to the remote nodes.
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > mpirun noticed that the job aborted, but has no info as to the
>> > > process
>> > > > that caused that situation.
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > mpirun: clean termination accomplished
>> > > >
>> > > >
>> > > > Lenny.
>> > > >
>> > > >
>> > > > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
>> > > > >
>> > > > > Hi ,
>> > > > >
>> > > > > I am currently testing the process affinity capabilities of
>> > > > openmpi and I
>> > > > > would like to know if the rankfile behaviour I will describe
>> > below
>> > > > is normal
>> > > > > or not ?
>> > > > >
>> > > > > cat hostfile.0
>> > > > > r011n002 slots=4
>> > > > > r011n003 slots=4
>> > > > >
>> > > > > cat rankfile.0
>> > > > > rank 0=r011n002 slot=0
>> > > > > rank 1=r011n003 slot=1
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> ##################################################################################
>> > > > >
>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ###
>> > OK
>> > > > > r011n002
>> > > > > r011n003
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> ##################################################################################
>> > > > > but
>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> > > > hostname
>> > > > > ### CRASHED
>> > > > > *
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > rmaps_rank_file.c at line 404
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > base/rmaps_base_map_job.c at line 87
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > base/plm_base_launch_support.c at line 77
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > plm_rsh_module.c at line 985
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while
>> > > > attempting to
>> > > > > launch so we are aborting.
>> > > > >
>> > > > > There may be more information reported by the environment (see
>> > > > above).
>> > > > >
>> > > > > This may be because the daemon was unable to find all the needed
>> > > > shared
>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>> > to
>> > > > have the
>> > > > > location of the shared libraries on the remote nodes and this
>> > will
>> > > > > automatically be forwarded to the remote nodes.
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > orterun noticed that the job aborted, but has no info as to the
>> > > > process
>> > > > > that caused that situation.
>> > > > >
>> > > >
>> > >
>> >
>> --------------------------------------------------------------------------
>> > > > > orterun: clean termination accomplished
>> > > > > *
>> > > > > It seems that the rankfile option is not propagted to the second
>> > > > command
>> > > > > line ; there is no global understanding of the ranking inside a
>> > > > mpirun
>> > > > > command.
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> ##################################################################################
>> > > > >
>> > > > > Assuming that , I tried to provide a rankfile to each command
>> > > line:
>> > > > >
>> > > > > cat rankfile.0
>> > > > > rank 0=r011n002 slot=0
>> > > > >
>> > > > > cat rankfile.1
>> > > > > rank 0=r011n003 slot=1
>> > > > >
>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
>> > > > rankfile.1
>> > > > > -n 1 hostname ### CRASHED
>> > > > > *[r011n002:28778] *** Process received signal ***
>> > > > > [r011n002:28778] Signal: Segmentation fault (11)
>> > > > > [r011n002:28778] Signal code: Address not mapped (1)
>> > > > > [r011n002:28778] Failing at address: 0x34
>> > > > > [r011n002:28778] [ 0] [0xffffe600]
>> > > > > [r011n002:28778] [ 1]
>> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>> > > > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>> > > > > [0x5557decd]
>> > > > > [r011n002:28778] [ 2]
>> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>> > > > 0(orte_plm_base_launch_apps+0x117)
>> > > > > [0x555842a7]
>> > > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
>> > > > mca_plm_rsh.so
>> > > > > [0x556098c0]
>> > > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> > > > [0x804aa27]
>> > > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> > > > [0x804a022]
>> > > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
>> > > > [0x9f1dec]
>> > > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> > > > [0x8049f71]
>> > > > > [r011n002:28778] *** End of error message ***
>> > > > > Segmentation fault (core dumped)*
>> > > > >
>> > > > >
>> > > > >
>> > > > > I hope that I've found a bug because it would be very important
>> > > > for me to
>> > > > > have this kind of capabiliy .
>> > > > > Launch a multiexe mpirun command line and be able to bind my
>> > exes
>> > > > and
>> > > > > sockets together.
>> > > > >
>> > > > > Thanks in advance for your help
>> > > > >
>> > > > > Geoffroy
>> > > > _______________________________________________
>> > > > users mailing list
>> > > > users_at_[hidden]
>> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > >
>> > > -------------- next part --------------
>> > > HTML attachment scrubbed and removed
>> > >
>> > > ------------------------------
>> > >
>> > > _______________________________________________
>> > > users mailing list
>> > > users_at_[hidden]
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > >
>> > > End of users Digest, Vol 1202, Issue 2
>> > > **************************************
>> > >
>> > > _______________________________________________
>> > > users mailing list
>> > > users_at_[hidden]
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > -------------- next part --------------
>> > HTML attachment scrubbed and removed
>> >
>> > ------------------------------
>> >
>> > Message: 2
>> > Date: Tue, 14 Apr 2009 10:30:58 -0400
>> > From: Prentice Bisbal <prentice_at_[hidden]>
>> > Subject: Re: [OMPI users] PGI Fortran pthread support
>> > To: Open MPI Users <users_at_[hidden]>
>> > Message-ID: <49E49E22.9040502_at_[hidden]>
>> > Content-Type: text/plain; charset=ISO-8859-1
>> >
>> > Orion,
>> >
>> > I have no trouble getting thread support during configure with PGI
>> > 8.0-3
>> >
>> > Are there any other compilers in your path before the PGI compilers?
>> > Even if the PGI compilers come first, try specifying the PGI compilers
>> > explicitly with these environment variables (bash syntax shown):
>> >
>> > export CC=pgcc
>> > export CXX=pgCC
>> > export F77=pgf77
>> > export FC=pgf90
>> >
>> > also check the value of CPPFLAGS and LDFLAGS, and make sure they are
>> > correct for your PGI compilers.
>> >
>> > --
>> > Prentice
>> >
>> > Orion Poplawski wrote:
>> > > Seeing the following building openmpi 1.3.1 on CentOS 5.3 with PGI
>> > pgf90
>> > > 8.0-5 fortran compiler:
>> > >
>> > > checking if C compiler and POSIX threads work with -Kthread... no
>> > > checking if C compiler and POSIX threads work with -kthread... no
>> > > checking if C compiler and POSIX threads work with -pthread... yes
>> > > checking if C++ compiler and POSIX threads work with -Kthread... no
>> > > checking if C++ compiler and POSIX threads work with -kthread... no
>> > > checking if C++ compiler and POSIX threads work with -pthread... yes
>> > > checking if F77 compiler and POSIX threads work with -Kthread... no
>> > > checking if F77 compiler and POSIX threads work with -kthread... no
>> > > checking if F77 compiler and POSIX threads work with -pthread... no
>> > > checking if F77 compiler and POSIX threads work with -pthreads... no
>> > > checking if F77 compiler and POSIX threads work with -mt... no
>> > > checking if F77 compiler and POSIX threads work with -mthreads... no
>> > > checking if F77 compiler and POSIX threads work with -lpthreads...
>> > no
>> > > checking if F77 compiler and POSIX threads work with -llthread... no
>> > > checking if F77 compiler and POSIX threads work with -lpthread... no
>> > > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes
>> > > checking for PTHREAD_MUTEX_ERRORCHECK... yes
>> > > checking for working POSIX threads package... no
>> > > checking if C compiler and Solaris threads work... no
>> > > checking if C++ compiler and Solaris threads work... no
>> > > checking if F77 compiler and Solaris threads work... no
>> > > checking for working Solaris threads package... no
>> > > checking for type of thread support... none found
>> > >
>> >
>> >
>> >
>> > ------------------------------
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > End of users Digest, Vol 1202, Issue 4
>> > **************************************
>> >
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 1208, Issue 2
>> **************************************
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>