Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
From: Geoffroy Pignot (geopignot_at_[hidden])
Date: 2009-04-14 02:19:02


Hi,

I agree that my examples are not very clear. What I want to do is to launch
a multiexes application (masters-slaves) and benefit from the processor
affinity.
Could you show me how to convert this command , using -rf option (whatever
the affinity is)

mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002 master.x
options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -host r001n002
slave.x options4

Thanks for your help

Geoffroy

>
> Message: 2
> Date: Sun, 12 Apr 2009 18:26:35 +0300
> From: Lenny Verkhovsky <lenny.verkhovsky_at_[hidden]>
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users <users_at_[hidden]>
> Message-ID:
> <453d39990904120826t2e1d1d33l7bb1fe3de65b5361_at_[hidden]>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> The first "crash" is OK, since your rankfile has ranks 0 and 1 defined,
> while n=1, which means only rank 0 is present and can be allocated.
>
> NP must be >= the largest rank in rankfile.
>
> What exactly are you trying to do ?
>
> I tried to recreate your seqv but all I got was
>
> ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile hostfile.0
> -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> [witch19:30798] mca: base: component_find: paffinity "mca_paffinity_linux"
> uses an MCA interface that is not recognized (component MCA v1.0.0 !=
> supported MCA v2.0.0) -- ignored
> --------------------------------------------------------------------------
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> opal_carto_base_select failed
> --> Returned value -13 instead of OPAL_SUCCESS
> --------------------------------------------------------------------------
> [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ../../orte/runtime/orte_init.c at line 78
> [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ../../orte/orted/orted_main.c at line 344
> --------------------------------------------------------------------------
> A daemon (pid 11629) died unexpectedly with status 243 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
>
> Lenny.
>
>
> On 4/10/09, Geoffroy Pignot <geopignot_at_[hidden]> wrote:
> >
> > Hi ,
> >
> > I am currently testing the process affinity capabilities of openmpi and I
> > would like to know if the rankfile behaviour I will describe below is
> normal
> > or not ?
> >
> > cat hostfile.0
> > r011n002 slots=4
> > r011n003 slots=4
> >
> > cat rankfile.0
> > rank 0=r011n002 slot=0
> > rank 1=r011n003 slot=1
> >
> >
> >
> ##################################################################################
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK
> > r011n002
> > r011n003
> >
> >
> >
> ##################################################################################
> > but
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
> > ### CRASHED
> > *
> >
> --------------------------------------------------------------------------
> > Error, invalid rank (1) in the rankfile (rankfile.0)
> >
> --------------------------------------------------------------------------
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > rmaps_rank_file.c at line 404
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > base/rmaps_base_map_job.c at line 87
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > base/plm_base_launch_support.c at line 77
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > plm_rsh_module.c at line 985
> >
> --------------------------------------------------------------------------
> > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> > launch so we are aborting.
> >
> > There may be more information reported by the environment (see above).
> >
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >
> --------------------------------------------------------------------------
> >
> --------------------------------------------------------------------------
> > orterun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> >
> --------------------------------------------------------------------------
> > orterun: clean termination accomplished
> > *
> > It seems that the rankfile option is not propagted to the second command
> > line ; there is no global understanding of the ranking inside a mpirun
> > command.
> >
> >
> >
> ##################################################################################
> >
> > Assuming that , I tried to provide a rankfile to each command line:
> >
> > cat rankfile.0
> > rank 0=r011n002 slot=0
> >
> > cat rankfile.1
> > rank 0=r011n003 slot=1
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
> rankfile.1
> > -n 1 hostname ### CRASHED
> > *[r011n002:28778] *** Process received signal ***
> > [r011n002:28778] Signal: Segmentation fault (11)
> > [r011n002:28778] Signal code: Address not mapped (1)
> > [r011n002:28778] Failing at address: 0x34
> > [r011n002:28778] [ 0] [0xffffe600]
> > [r011n002:28778] [ 1]
> >
> /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x55d)
> > [0x5557decd]
> > [r011n002:28778] [ 2]
> >
> /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x117)
> > [0x555842a7]
> > [r011n002:28778] [ 3]
> /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/mca_plm_rsh.so
> > [0x556098c0]
> > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804aa27]
> > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804a022]
> > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) [0x9f1dec]
> > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x8049f71]
> > [r011n002:28778] *** End of error message ***
> > Segmentation fault (core dumped)*
> >
> >
> >
> > I hope that I've found a bug because it would be very important for me to
> > have this kind of capabiliy .
> > Launch a multiexe mpirun command line and be able to bind my exes and
> > sockets together.
> >
> > Thanks in advance for your help
> >
> > Geoffroy
>