Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Geoffroy Pignot (geopignot_at_[hidden])
Date: 2009-04-20 04:58:46


Do you have any news about this bug.
Thanks

Geoffroy

>
> Message: 1
> Date: Tue, 14 Apr 2009 07:57:44 -0600
> From: Ralph Castain <rhc_at_[hidden]>
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <BEB90473-0747-43BF-A1E9-6FA4E77778D7_at_[hidden]>
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> DelSp="yes"
>
> Ah now, I didn't say it -worked-, did I? :-)
>
> Clearly a bug exists in the program. I'll try to take a look at it (if
> Lenny doesn't get to it first), but it won't be until later in the week.
>
> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>
> > I agree with you Ralph , and that 's what I expect from openmpi but
> > my second example shows that it's not working
> >
> > cat hostfile.0
> > r011n002 slots=4
> > r011n003 slots=4
> >
> > cat rankfile.0
> > rank 0=r011n002 slot=0
> > rank 1=r011n003 slot=1
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> > hostname
> > ### CRASHED
> >
> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > rmaps_rank_file.c at line 404
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/rmaps_base_map_job.c at line 87
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/plm_base_launch_support.c at line 77
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > plm_rsh_module.c at line 985
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > A daemon (pid unknown) died unexpectedly on signal 1 while
> > > attempting to
> > > > launch so we are aborting.
> > > >
> > > > There may be more information reported by the environment (see
> > > above).
> > > >
> > > > This may be because the daemon was unable to find all the needed
> > > shared
> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > > have the
> > > > location of the shared libraries on the remote nodes and this will
> > > > automatically be forwarded to the remote nodes.
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > orterun noticed that the job aborted, but has no info as to the
> > > process
> > > > that caused that situation.
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > orterun: clean termination accomplished
> >
> >
> >
> > Message: 4
> > Date: Tue, 14 Apr 2009 06:55:58 -0600
> > From: Ralph Castain <rhc_at_[hidden]>
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users <users_at_[hidden]>
> > Message-ID: <F6290ADA-A196-43F0-A853-CBCB802D8D9C_at_[hidden]>
> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> > DelSp="yes"
> >
> > The rankfile cuts across the entire job - it isn't applied on an
> > app_context basis. So the ranks in your rankfile must correspond to
> > the eventual rank of each process in the cmd line.
> >
> > Unfortunately, that means you have to count ranks. In your case, you
> > only have four, so that makes life easier. Your rankfile would look
> > something like this:
> >
> > rank 0=r001n001 slot=0
> > rank 1=r001n002 slot=1
> > rank 2=r001n001 slot=1
> > rank 3=r001n002 slot=2
> >
> > HTH
> > Ralph
> >
> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
> >
> > > Hi,
> > >
> > > I agree that my examples are not very clear. What I want to do is to
> > > launch a multiexes application (masters-slaves) and benefit from the
> > > processor affinity.
> > > Could you show me how to convert this command , using -rf option
> > > (whatever the affinity is)
> > >
> > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002
> > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> > > host r001n002 slave.x options4
> > >
> > > Thanks for your help
> > >
> > > Geoffroy
> > >
> > >
> > >
> > >
> > >
> > > Message: 2
> > > Date: Sun, 12 Apr 2009 18:26:35 +0300
> > > From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > > To: Open MPI Users <users_at_[hidden]>
> > > Message-ID:
> > > <453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]>
> > > Content-Type: text/plain; charset="iso-8859-1"
> > >
> > > Hi,
> > >
> > > The first "crash" is OK, since your rankfile has ranks 0 and 1
> > > defined,
> > > while n=1, which means only rank 0 is present and can be allocated.
> > >
> > > NP must be >= the largest rank in rankfile.
> > >
> > > What exactly are you trying to do ?
> > >
> > > I tried to recreate your seqv but all I got was
> > >
> > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> > > hostfile.0
> > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> > > [witch19:30798] mca: base: component_find: paffinity
> > > "mca_paffinity_linux"
> > > uses an MCA interface that is not recognized (component MCA
> > v1.0.0 !=
> > > supported MCA v2.0.0) -- ignored
> > >
> >
> --------------------------------------------------------------------------
> > > It looks like opal_init failed for some reason; your parallel
> > > process is
> > > likely to abort. There are many reasons that a parallel process can
> > > fail during opal_init; some of which are due to configuration or
> > > environment problems. This failure appears to be an internal
> > failure;
> > > here's some additional information (which may only be relevant to an
> > > Open MPI developer):
> > >
> > > opal_carto_base_select failed
> > > --> Returned value -13 instead of OPAL_SUCCESS
> > >
> >
> --------------------------------------------------------------------------
> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
> > file
> > > ../../orte/runtime/orte_init.c at line 78
> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
> > file
> > > ../../orte/orted/orted_main.c at line 344
> > >
> >
> --------------------------------------------------------------------------
> > > A daemon (pid 11629) died unexpectedly with status 243 while
> > > attempting
> > > to launch so we are aborting.
> > >
> > > There may be more information reported by the environment (see
> > above).
> > >
> > > This may be because the daemon was unable to find all the needed
> > > shared
> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > > have the
> > > location of the shared libraries on the remote nodes and this will
> > > automatically be forwarded to the remote nodes.
> > >
> >
> --------------------------------------------------------------------------
> > >
> >
> --------------------------------------------------------------------------
> > > mpirun noticed that the job aborted, but has no info as to the
> > process
> > > that caused that situation.
> > >
> >
> --------------------------------------------------------------------------
> > > mpirun: clean termination accomplished
> > >
> > >
> > > Lenny.
> > >
> > >
> > > On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
> > > >
> > > > Hi ,
> > > >
> > > > I am currently testing the process affinity capabilities of
> > > openmpi and I
> > > > would like to know if the rankfile behaviour I will describe below
> > > is normal
> > > > or not ?
> > > >
> > > > cat hostfile.0
> > > > r011n002 slots=4
> > > > r011n003 slots=4
> > > >
> > > > cat rankfile.0
> > > > rank 0=r011n002 slot=0
> > > > rank 1=r011n003 slot=1
> > > >
> > > >
> > > >
> > >
> >
> ##################################################################################
> > > >
> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK
> > > > r011n002
> > > > r011n003
> > > >
> > > >
> > > >
> > >
> >
> ##################################################################################
> > > > but
> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> > > hostname
> > > > ### CRASHED
> > > > *
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > rmaps_rank_file.c at line 404
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/rmaps_base_map_job.c at line 87
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/plm_base_launch_support.c at line 77
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > plm_rsh_module.c at line 985
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > A daemon (pid unknown) died unexpectedly on signal 1 while
> > > attempting to
> > > > launch so we are aborting.
> > > >
> > > > There may be more information reported by the environment (see
> > > above).
> > > >
> > > > This may be because the daemon was unable to find all the needed
> > > shared
> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > > have the
> > > > location of the shared libraries on the remote nodes and this will
> > > > automatically be forwarded to the remote nodes.
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > orterun noticed that the job aborted, but has no info as to the
> > > process
> > > > that caused that situation.
> > > >
> > >
> >
> --------------------------------------------------------------------------
> > > > orterun: clean termination accomplished
> > > > *
> > > > It seems that the rankfile option is not propagted to the second
> > > command
> > > > line ; there is no global understanding of the ranking inside a
> > > mpirun
> > > > command.
> > > >
> > > >
> > > >
> > >
> >
> ##################################################################################
> > > >
> > > > Assuming that , I tried to provide a rankfile to each command
> > line:
> > > >
> > > > cat rankfile.0
> > > > rank 0=r011n002 slot=0
> > > >
> > > > cat rankfile.1
> > > > rank 0=r011n003 slot=1
> > > >
> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
> > > rankfile.1
> > > > -n 1 hostname ### CRASHED
> > > > *[r011n002:28778] *** Process received signal ***
> > > > [r011n002:28778] Signal: Segmentation fault (11)
> > > > [r011n002:28778] Signal code: Address not mapped (1)
> > > > [r011n002:28778] Failing at address: 0x34
> > > > [r011n002:28778] [ 0] [0xffffe600]
> > > > [r011n002:28778] [ 1]
> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> > > 0(orte_odls_base_default_get_add_procs_data+0x55d)
> > > > [0x5557decd]
> > > > [r011n002:28778] [ 2]
> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
> > > 0(orte_plm_base_launch_apps+0x117)
> > > > [0x555842a7]
> > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
> > > mca_plm_rsh.so
> > > > [0x556098c0]
> > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> > > [0x804aa27]
> > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> > > [0x804a022]
> > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
> > > [0x9f1dec]
> > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
> > > [0x8049f71]
> > > > [r011n002:28778] *** End of error message ***
> > > > Segmentation fault (core dumped)*
> > > >
> > > >
> > > >
> > > > I hope that I've found a bug because it would be very important
> > > for me to
> > > > have this kind of capabiliy .
> > > > Launch a multiexe mpirun command line and be able to bind my exes
> > > and
> > > > sockets together.
> > > >
> > > > Thanks in advance for your help
> > > >
> > > > Geoffroy
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > -------------- next part --------------
> > HTML attachment scrubbed and removed
> >
> > ------------------------------
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > End of users Digest, Vol 1202, Issue 2
> > **************************************
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Tue, 14 Apr 2009 10:30:58 -0400
> From: Prentice Bisbal <prentice_at_[hidden]>
> Subject: Re: [OMPI users] PGI Fortran pthread support
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <49E49E22.9040502_at_[hidden]>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Orion,
>
> I have no trouble getting thread support during configure with PGI 8.0-3
>
> Are there any other compilers in your path before the PGI compilers?
> Even if the PGI compilers come first, try specifying the PGI compilers
> explicitly with these environment variables (bash syntax shown):
>
> export CC=pgcc
> export CXX=pgCC
> export F77=pgf77
> export FC=pgf90
>
> also check the value of CPPFLAGS and LDFLAGS, and make sure they are
> correct for your PGI compilers.
>
> --
> Prentice
>
> Orion Poplawski wrote:
> > Seeing the following building openmpi 1.3.1 on CentOS 5.3 with PGI pgf90
> > 8.0-5 fortran compiler:
> >
> > checking if C compiler and POSIX threads work with -Kthread... no
> > checking if C compiler and POSIX threads work with -kthread... no
> > checking if C compiler and POSIX threads work with -pthread... yes
> > checking if C++ compiler and POSIX threads work with -Kthread... no
> > checking if C++ compiler and POSIX threads work with -kthread... no
> > checking if C++ compiler and POSIX threads work with -pthread... yes
> > checking if F77 compiler and POSIX threads work with -Kthread... no
> > checking if F77 compiler and POSIX threads work with -kthread... no
> > checking if F77 compiler and POSIX threads work with -pthread... no
> > checking if F77 compiler and POSIX threads work with -pthreads... no
> > checking if F77 compiler and POSIX threads work with -mt... no
> > checking if F77 compiler and POSIX threads work with -mthreads... no
> > checking if F77 compiler and POSIX threads work with -lpthreads... no
> > checking if F77 compiler and POSIX threads work with -llthread... no
> > checking if F77 compiler and POSIX threads work with -lpthread... no
> > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes
> > checking for PTHREAD_MUTEX_ERRORCHECK... yes
> > checking for working POSIX threads package... no
> > checking if C compiler and Solaris threads work... no
> > checking if C++ compiler and Solaris threads work... no
> > checking if F77 compiler and Solaris threads work... no
> > checking for working Solaris threads package... no
> > checking for type of thread support... none found
> >
>
>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 1202, Issue 4
> **************************************
>