Dear Tim and Scott
I followed the suggestions made:
>
> So you should either pass '-mca btl mx,sm,self', or just pass
> nothing at all.
> Open MPI is fairly smart at figuring out what components to
> use, so you really should not need to specify anything.
>
Using
node001>mpirun --mca btl mx,sm,self -np 4 -hostfile ompi_machinefile
./cpi
conects to some of the mx ports, not all 4, but the program runs:
[node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:01564] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
It spawned 4 processes on node001. Passing nothing at all gave the same
problem.
> Also, could you try creating a host file named "hosts" with
> the names of your machines and then try:
>
> $ mpirun -np 2 --hostfile hosts ./cpi
>
> and then
>
> $ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi
node001>mpirun -np 2 -hostfile ompi_machinefile ./cpi_gcc_ompi_mx
works but increasing to 4 cores again uses less than 4 ports.
Finally
node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx
is successful even for -np 4. From here I tried 2 nodes:
node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx
This gave:
orted: Command not found.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1164
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90
[node001:04585] ERROR: A daemon on node node002 failed to start as
expected.
[node001:04585] ERROR: There may be more information available from
[node001:04585] ERROR: the remote shell (see above).
[node001:04585] ERROR: The daemon exited unexpectedly with status 1.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1196
------------------------------------------------------------------------
--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
------------------------------------------------------------------------
--
Apparently orted is not started up properly. Something missing in the
installation?
Thanks
Henk
> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> Sent: 06 July 2007 15:59
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
>
> Henk,
>
> On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote:
> > Dear Tim
> >
> > I followed the use of "--mca btl mx,self" as suggested in the FAQ
> >
> > http://www.open-mpi.org/faq/?category=myrinet#myri-btl
> Yeah, that FAQ is wrong. I am working right now to fix it up.
> It should be updated this afternoon.
>
> >
> > When I use your extra mca value I get:
> > >mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi
> >
> >
> ----------------------------------------------------------------------
> > --
> > --
> >
> > > WARNING: A user-supplied value attempted to override the
> read-only
> > > MCA parameter named "btl_mx_shared_mem".
> > >
> > > The user-supplied value was ignored.
> Opps, on the 1.2 branch this is a read-only parameter. On the
> current trunk the user can change it. Sorry for the
> confusion. Oh well, you should probably use Open MPI's shared
> memory support instead anyways.
>
> So you should either pass '-mca btl mx,sm,self', or just pass
> nothing at all.
> Open MPI is fairly smart at figuring out what components to
> use, so you really should not need to specify anything.
>
> > followed by the same error messages as before.
> >
> > Note that although I add "self" the error messages complain about it
> >
> > missing:
> > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > > > If you specified the use of a BTL component, you may have
> > >
> > > forgotten a
> > >
> > > > component (such as "self") in the list of usable components.
> >
> > I checked the output from mx_info for both the current node and
> > another, there seems not to be a problem.
> > I attch the output from ompi_info --all Also
> >
> > >ompi_info | grep mx
> >
> > Prefix:
> > /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3
> > MCA btl: mx (MCA v1.0, API v1.0.1,
> Component v1.2.3)
> > MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3)
> >
> > As a further check, I rebuild the exe with mpich and that
> works fine
> > on the same node over myrinet. I wonder whether mx is
> properly include
> > in my openmpi build.
> > Use of ldd -v on the mpich exe gives references to
> libmyriexpress.so,
> > which is not the case for the ompi built exe, suggesting
> something is
> > missing?
> No, this is expected behavior. The Open MPI executeables are
> not linked to libmyriexpress.so, only the mx components. So
> if you do a ldd on
> /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mc
a_btl_mx.so,
> this should show the Myrinet library.
>
> > I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1
> > and a listing of that directory is
> >
> > >ls /usr/local/Cluster-Apps/mx/mx-1.1.1
> >
> > bin etc include lib lib32 lib64 sbin
> >
> > This should be sufficient, I don't need --with-mx-libdir?
> Correct.
>
>
> Hope this helps,
>
> Tim
>
> >
> > Thanks
> >
> > Henk
> >
> > > -----Original Message-----
> > > From: users-bounces_at_[hidden]
> > > [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> > > Sent: 05 July 2007 18:16
> > > To: Open MPI Users
> > > Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> > >
> > > Hi Henk,
> > >
> > > By specifying '--mca btl mx,self' you are telling Open MPI not to
> > > use its shared memory support. If you want to use Open
> MPI's shared
> > > memory support, you must add 'sm' to the list.
> > > I.e. '--mca btl mx,self'. If you would rather use MX's
> shared memory
> > > support, instead use '--mca btl mx,self --mca
> btl_mx_shared_mem 1'.
> > > However, in most cases I believe Open MPI's shared memory
> support is
> > > a bit better.
> > >
> > > Alternatively, if you don't specify any btls, Open MPI
> should figure
> > > out what to use automatically.
> > >
> > > Hope this helps,
> > >
> > > Tim
> > >
> > > SLIM H.A. wrote:
> > > > Hello
> > > >
> > > > I have compiled openmpi-1.2.3 with the --with-mx=<directory>
> > > > configuration and gcc compiler. On testing with 4-8
> slots I get an
> > > >
> > > > error message, the mx ports are busy:
> > > >> mpirun --mca btl mx,self -np 4 ./cpi
> > > >
> > > > [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with
> > > > status=20 [node001:10074] mca_btl_mx_init:
> > >
> > > mx_open_endpoint() failed
> > >
> > > > with status=20 [node001:10073] mca_btl_mx_init:
> mx_open_endpoint()
> > > > failed with status=20
> > >
> > >
> --------------------------------------------------------------------
> > > --
> > >
> > > > --
> > > > --
> > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > > > If you specified the use of a BTL component, you may have
> > >
> > > forgotten a
> > >
> > > > component (such as "self") in the list of usable components.
> > > > ... snipped
> > > > It looks like MPI_INIT failed for some reason; your
> > >
> > > parallel process
> > >
> > > > is likely to abort. There are many reasons that a parallel
> > > > process can fail during MPI_INIT; some of which are due to
> > > > configuration or environment problems. This failure
> appears to be
> > > > an
> > >
> > > internal failure;
> > >
> > > > here's some additional information (which may only be
> > >
> > > relevant to an
> > >
> > > > Open MPI
> > > > developer):
> > > >
> > > > PML add procs failed
> > > > --> Returned "Unreachable" (-12) instead of "Success" (0)
> > >
> > >
> --------------------------------------------------------------------
> > > --
> > >
> > > > --
> > > > --
> > > > *** An error occurred in MPI_Init
> > > > *** before MPI was initialized
> > > > *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that
> job rank 0
> > > > with PID 10071 on node
> > >
> > > node001 exited
> > >
> > > > on signal 1 (Hangup).
> > > >
> > > >
> > > > I would not expect mx messages as communication should not
> > >
> > > go through
> > >
> > > > the mx card? (This is a twin dual core shared memory node)
> > >
> > > The same
> > >
> > > > happens when testing on 2 nodes, using a hostfile.
> > > > I checked the state of the mx card with mx_endpoint_info
> > >
> > > and mx_info,
> > >
> > > > they are healthy and free.
> > > > What is missing here?
> > > >
> > > > Thanks
> > > >
> > > > Henk
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users_at_[hidden]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
|