Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: SLIM H.A. (h.a.slim_at_[hidden])
Date: 2007-07-09 11:10:12


 
Dear Tim and Scott

I followed the suggestions made:

>
> So you should either pass '-mca btl mx,sm,self', or just pass
> nothing at all.
> Open MPI is fairly smart at figuring out what components to
> use, so you really should not need to specify anything.
>

Using

node001>mpirun --mca btl mx,sm,self -np 4 -hostfile ompi_machinefile
./cpi

conects to some of the mx ports, not all 4, but the program runs:

[node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:01564] mca_btl_mx_init: mx_open_endpoint() failed with
status=20

It spawned 4 processes on node001. Passing nothing at all gave the same
problem.

> Also, could you try creating a host file named "hosts" with
> the names of your machines and then try:
>
> $ mpirun -np 2 --hostfile hosts ./cpi
>
> and then
>
> $ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi

node001>mpirun -np 2 -hostfile ompi_machinefile ./cpi_gcc_ompi_mx

works but increasing to 4 cores again uses less than 4 ports.
Finally

node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx

is successful even for -np 4. From here I tried 2 nodes:

node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx

This gave:

orted: Command not found.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1164
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90
[node001:04585] ERROR: A daemon on node node002 failed to start as
expected.
[node001:04585] ERROR: There may be more information available from
[node001:04585] ERROR: the remote shell (see above).
[node001:04585] ERROR: The daemon exited unexpectedly with status 1.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1196
------------------------------------------------------------------------

--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
------------------------------------------------------------------------
--
Apparently orted is not started up properly. Something missing in the
installation?
Thanks
Henk
> -----Original Message-----
> From: users-bounces_at_[hidden] 
> [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> Sent: 06 July 2007 15:59
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> 
> Henk,
> 
> On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote:
> > Dear Tim
> >
> > I followed the use of "--mca btl mx,self" as suggested in the FAQ
> >
> > http://www.open-mpi.org/faq/?category=myrinet#myri-btl
> Yeah, that FAQ is wrong. I am working right now to fix it up. 
> It should be updated this afternoon.
> 
> >
> > When I use your extra mca value I get:
> > >mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi
> >
> > 
> ----------------------------------------------------------------------
> > --
> > --
> >
> > > WARNING: A user-supplied value attempted to override the 
> read-only 
> > > MCA parameter named "btl_mx_shared_mem".
> > >
> > > The user-supplied value was ignored.
> Opps, on the 1.2 branch this is a read-only parameter. On the 
> current trunk the user can change it. Sorry for the 
> confusion. Oh well, you should probably use Open MPI's shared 
> memory support instead anyways.
> 
> So you should either pass '-mca btl mx,sm,self', or just pass 
> nothing at all. 
> Open MPI is fairly smart at figuring out what components to 
> use, so you really should not need to specify anything.
> 
> > followed by the same error messages as before.
> >
> > Note that although I add "self" the error messages complain about it
> >
> > missing:
> > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > > > If you specified the use of a BTL component, you may have
> > >
> > > forgotten a
> > >
> > > > component (such as "self") in the list of usable components.
> >
> > I checked the output from mx_info for both the current node and 
> > another, there seems not to be a problem.
> > I attch the output from ompi_info --all Also
> >
> > >ompi_info | grep mx
> >
> >                   Prefix:
> > /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3
> >                  MCA btl: mx (MCA v1.0, API v1.0.1, 
> Component v1.2.3)
> >                  MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3)
> >
> > As a further check, I rebuild the exe with mpich and that 
> works fine 
> > on the same node over myrinet. I wonder whether mx is 
> properly include 
> > in my openmpi build.
> > Use of ldd -v on the mpich exe gives references to 
> libmyriexpress.so, 
> > which is not the case for the ompi built exe, suggesting 
> something is 
> > missing?
> No, this is expected behavior. The Open MPI executeables are 
> not linked to libmyriexpress.so, only the mx components. So 
> if you do a ldd on 
> /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mc
a_btl_mx.so,
> this should show the Myrinet library.
> 
> > I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1
> > and a listing of that directory is
> >
> > >ls /usr/local/Cluster-Apps/mx/mx-1.1.1
> >
> > bin  etc  include  lib  lib32  lib64  sbin
> >
> > This should be sufficient, I don't need --with-mx-libdir?
> Correct.
> 
> 
> Hope this helps,
> 
> Tim
> 
> >
> > Thanks
> >
> > Henk
> >
> > > -----Original Message-----
> > > From: users-bounces_at_[hidden]
> > > [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> > > Sent: 05 July 2007 18:16
> > > To: Open MPI Users
> > > Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> > >
> > > Hi Henk,
> > >
> > > By specifying '--mca btl mx,self' you are telling Open MPI not to 
> > > use its shared memory support. If you want to use Open 
> MPI's shared 
> > > memory support, you must add 'sm' to the list.
> > > I.e. '--mca btl mx,self'. If you would rather use MX's 
> shared memory 
> > > support, instead use '--mca btl mx,self --mca 
> btl_mx_shared_mem 1'. 
> > > However, in most cases I believe Open MPI's shared memory 
> support is 
> > > a bit better.
> > >
> > > Alternatively, if you don't specify any btls, Open MPI 
> should figure 
> > > out what to use automatically.
> > >
> > > Hope this helps,
> > >
> > > Tim
> > >
> > > SLIM H.A. wrote:
> > > > Hello
> > > >
> > > > I have compiled openmpi-1.2.3 with the --with-mx=<directory> 
> > > > configuration and gcc compiler. On testing with 4-8 
> slots I get an
> > > >
> > > > error message, the mx ports are busy:
> > > >> mpirun --mca btl mx,self -np 4 ./cpi
> > > >
> > > > [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with 
> > > > status=20 [node001:10074] mca_btl_mx_init:
> > >
> > > mx_open_endpoint() failed
> > >
> > > > with status=20 [node001:10073] mca_btl_mx_init: 
> mx_open_endpoint() 
> > > > failed with status=20
> > >
> > > 
> --------------------------------------------------------------------
> > > --
> > >
> > > > --
> > > > --
> > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > > > If you specified the use of a BTL component, you may have
> > >
> > > forgotten a
> > >
> > > > component (such as "self") in the list of usable components.
> > > > ... snipped
> > > > It looks like MPI_INIT failed for some reason; your
> > >
> > > parallel process
> > >
> > > > is likely to abort.  There are many reasons that a parallel 
> > > > process can fail during MPI_INIT; some of which are due to 
> > > > configuration or environment problems.  This failure 
> appears to be 
> > > > an
> > >
> > > internal failure;
> > >
> > > > here's some additional information (which may only be
> > >
> > > relevant to an
> > >
> > > > Open MPI
> > > > developer):
> > > >
> > > >   PML add procs failed
> > > >   --> Returned "Unreachable" (-12) instead of "Success" (0)
> > >
> > > 
> --------------------------------------------------------------------
> > > --
> > >
> > > > --
> > > > --
> > > > *** An error occurred in MPI_Init
> > > > *** before MPI was initialized
> > > > *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that 
> job rank 0 
> > > > with PID 10071 on node
> > >
> > > node001 exited
> > >
> > > > on signal 1 (Hangup).
> > > >
> > > >
> > > > I would not expect mx messages as communication should not
> > >
> > > go through
> > >
> > > > the mx card? (This is a twin dual core  shared memory node)
> > >
> > > The same
> > >
> > > > happens when testing on 2 nodes, using a hostfile.
> > > > I checked the state of the mx card with mx_endpoint_info
> > >
> > > and mx_info,
> > >
> > > > they are healthy and free.
> > > > What is missing here?
> > > >
> > > > Thanks
> > > >
> > > > Henk
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>