Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-04-12 11:26:35


Hi,

The first "crash" is OK, since your rankfile has ranks 0 and 1 defined,
while n=1, which means only rank 0 is present and can be allocated.

NP must be >= the largest rank in rankfile.

What exactly are you trying to do ?

I tried to recreate your seqv but all I got was

~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile hostfile.0
-rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
[witch19:30798] mca: base: component_find: paffinity "mca_paffinity_linux"
uses an MCA interface that is not recognized (component MCA v1.0.0 !=
supported MCA v2.0.0) -- ignored
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_carto_base_select failed
  --> Returned value -13 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../orte/runtime/orte_init.c at line 78
[witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
../../orte/orted/orted_main.c at line 344
--------------------------------------------------------------------------
A daemon (pid 11629) died unexpectedly with status 243 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

Lenny.

On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
>
> Hi ,
>
> I am currently testing the process affinity capabilities of openmpi and I
> would like to know if the rankfile behaviour I will describe below is normal
> or not ?
>
> cat hostfile.0
> r011n002 slots=4
> r011n003 slots=4
>
> cat rankfile.0
> rank 0=r011n002 slot=0
> rank 1=r011n003 slot=1
>
>
> ##################################################################################
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK
> r011n002
> r011n003
>
>
> ##################################################################################
> but
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
> ### CRASHED
> *
> --------------------------------------------------------------------------
> Error, invalid rank (1) in the rankfile (rankfile.0)
> --------------------------------------------------------------------------
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> rmaps_rank_file.c at line 404
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/rmaps_base_map_job.c at line 87
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/plm_base_launch_support.c at line 77
> [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> plm_rsh_module.c at line 985
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> orterun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> orterun: clean termination accomplished
> *
> It seems that the rankfile option is not propagted to the second command
> line ; there is no global understanding of the ranking inside a mpirun
> command.
>
>
> ##################################################################################
>
> Assuming that , I tried to provide a rankfile to each command line:
>
> cat rankfile.0
> rank 0=r011n002 slot=0
>
> cat rankfile.1
> rank 0=r011n003 slot=1
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf rankfile.1
> -n 1 hostname ### CRASHED
> *[r011n002:28778] *** Process received signal ***
> [r011n002:28778] Signal: Segmentation fault (11)
> [r011n002:28778] Signal code: Address not mapped (1)
> [r011n002:28778] Failing at address: 0x34
> [r011n002:28778] [ 0] [0xffffe600]
> [r011n002:28778] [ 1]
> /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x55d)
> [0x5557decd]
> [r011n002:28778] [ 2]
> /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x117)
> [0x555842a7]
> [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/mca_plm_rsh.so
> [0x556098c0]
> [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804aa27]
> [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804a022]
> [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) [0x9f1dec]
> [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x8049f71]
> [r011n002:28778] *** End of error message ***
> Segmentation fault (core dumped)*
>
>
>
> I hope that I've found a bug because it would be very important for me to
> have this kind of capabiliy .
> Launch a multiexe mpirun command line and be able to bind my exes and
> sockets together.
>
> Thanks in advance for your help
>
> Geoffroy
>
>
>
>
>
>
>
>
>
>
> 2009/4/9 <users-request_at_[hidden]>
>
>> Send users mailing list submissions to
>> users_at_[hidden]
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>> users-request_at_[hidden]
>>
>> You can reach the person managing the list at
>> users-owner_at_[hidden]
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>>
>>
>> Today's Topics:
>>
>> 1. mpirun self,sm (Robert Kubrick)
>> 2. Re: mpirun self,sm (Ralph Castain)
>> 3. shared libraries issue compiling 1.3.1/intel 10.1.022
>> (Francesco Pietra)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Thu, 9 Apr 2009 00:15:03 -0400
>> From: Robert Kubrick <robertkubrick_at_[hidden]>
>> Subject: [OMPI users] mpirun self,sm
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <99AB3654-DD6A-4E96-94AC-B741073821ED_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>>
>> How is this possible?
>>
>> dx:~> mpirun -v -np 2 --mca btl self,sm --host dx,sx hostname
>> dx
>> sx
>>
>> dx:~> netstat -i
>> Kernel Interface table
>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-
>> OVR Flg
>> eth0 1500 0 998755 0 0 0 1070323 0
>> 0 0 BMRU
>> eth1 1500 0 85352 0 0 0 379993 0
>> 0 0 BMRU
>> ib0 65520 0 204 0 0 0 155 0
>> 5 0 BMRU
>> lo 16436 0 1648874 0 0 0 1648874 0
>> 0 0 LRU
>>
>> I want to force an error with the first command above to make sure
>> that my btl parameters are used correctly, but it looks like ompi
>> runs hostname on the remote machine regardless.
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Thu, 9 Apr 2009 02:16:08 -0600
>> From: Ralph Castain <rhc_at_[hidden]>
>> Subject: Re: [OMPI users] mpirun self,sm
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <FF3FCDB6-3E23-41F6-88BC-7D4F327E40DC_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>>
>> hostname never calls MPI_Init, and therefore never initializes the BTL
>> subsystem. Therefore, hostname will always run correctly.
>>
>> mpirun is not an MPI process, nor are the daemons used by OMPI to
>> launch the MPI job. Thus, they also do not call MPI_Init, and
>> therefore do not see the BTL subsystem.
>>
>> So this example will run just fine. You need to run an MPI application
>> to cause it to fail.
>>
>> Ralph
>>
>>
>> On Apr 8, 2009, at 10:15 PM, Robert Kubrick wrote:
>>
>> > How is this possible?
>> >
>> > dx:~> mpirun -v -np 2 --mca btl self,sm --host dx,sx hostname
>> > dx
>> > sx
>> >
>> > dx:~> netstat -i
>> > Kernel Interface table
>> > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>> > TX-OVR Flg
>> > eth0 1500 0 998755 0 0 0 1070323 0
>> > 0 0 BMRU
>> > eth1 1500 0 85352 0 0 0 379993 0
>> > 0 0 BMRU
>> > ib0 65520 0 204 0 0 0 155 0
>> > 5 0 BMRU
>> > lo 16436 0 1648874 0 0 0 1648874 0
>> > 0 0 LRU
>> >
>> > I want to force an error with the first command above to make sure
>> > that my btl parameters are used correctly, but it looks like ompi
>> > runs hostname on the remote machine regardless.
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Thu, 9 Apr 2009 17:31:16 +0200
>> From: Francesco Pietra <chiendarret_at_[hidden]>
>> Subject: [OMPI users] shared libraries issue compiling 1.3.1/intel
>> 10.1.022
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID:
>> <b87c422a0904090831q56a98e67w5000c90a94bf8a37_at_[hidden]>
>> Content-Type: text/plain; charset=UTF-8
>>
>> Hi:
>> As failure to find "limits.h" in my attempted compilations of Amber of
>> th fast few days (amd64 lenny, openmpi 1.3.1, intel compilers
>> 10.1.015) is probably (or I hope so) a bug of the version used of
>> intel compilers (I made with debian the same observations reported
>> for gentoo,
>> http://software.intel.com/en-us/forums/intel-c-compiler/topic/59886/).
>>
>> I made a deb package of 10.1.022, icc and ifort.
>>
>> ./configure CC icc, CXX icp, F77 and FC ifort --with-libnuma=/usr (not
>> /usr/lib so that the numa.h issue is not raised), "make clean", and
>> "mak install" gave no error signals. However, the compilation tests in
>> the examples did not pass and I really don't understand why.
>>
>> The error, with both connectivity_c and hello_c (I was operating on
>> the parallel computer deb64 directly and have access to everything
>> there) was failure to find a shared library, libimf.so
>>
>> # dpkg --search libimf.so
>> /opt/intel/fce/10.1.022/lib/libimf.so (the same for cce)
>>
>> which path seems to be correctly sourced by iccvars.sh and
>> ifortvars.sh (incidentally, both files are -rw-r--r-- root root;
>> correct that non executable?)
>>
>> echo $LD_LIBRARY_PATH
>> returned, inter alia,
>> /opt/intel/mkl/
>> 10.1.2.024/lib/em64t:/opt/intel/mkl/10.1.2.024/lib/em64t:/opt/intel/cce/10.1.022/lib:/opt/intel/fce/10.1.022/lib
>> (why twice the mkl?)
>>
>> I surely miss to understand something fundamental. Hope other eyes see
>> better
>>
>> A kind person elsewhere suggested me on passing "The use of -rpath
>> during linking is highly recommended as opposed to setting
>> LD_LIBRARY_PATH at run time, not the least because it hardcodes the
>> paths to the "right" library files in the executables themselves"
>> Should this be relevant to the present issue, where to learn about
>> -rpath linking?
>>
>> thanks
>> francesco pietra
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 1197, Issue 1
>> **************************************
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>