Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: SLIM H.A. (h.a.slim_at_[hidden])
Date: 2007-07-06 05:34:35


Dear Tim

I followed the use of "--mca btl mx,self" as suggested in the FAQ

http://www.open-mpi.org/faq/?category=myrinet#myri-btl

 
When I use your extra mca value I get:

>mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi
>
------------------------------------------------------------------------

--
> WARNING: A user-supplied value attempted to override the read-only MCA
> parameter named "btl_mx_shared_mem".
> The user-supplied value was ignored.
followed by the same error messages as before.
Note that although I add "self" the error messages complain about it
missing:
> > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > If you specified the use of a BTL component, you may have 
> forgotten a 
> > component (such as "self") in the list of usable components.
I checked the output from mx_info for both the current node and another,
there seems not to be a problem.
I attch the output from ompi_info --all
Also
>ompi_info | grep mx
                  Prefix:
/usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3
                 MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2.3)
                 MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3)
As a further check, I rebuild the exe with mpich and that works fine on
the same node over myrinet. I wonder whether mx is properly include in
my openmpi build.
Use of ldd -v on the mpich exe gives references to libmyriexpress.so,
which is not the case for the ompi built exe, suggesting something is
missing?
I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1
and a listing of that directory is
>ls /usr/local/Cluster-Apps/mx/mx-1.1.1
bin  etc  include  lib  lib32  lib64  sbin
This should be sufficient, I don't need --with-mx-libdir?
Thanks
Henk
> -----Original Message-----
> From: users-bounces_at_[hidden] 
> [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> Sent: 05 July 2007 18:16
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> 
> Hi Henk,
> 
> By specifying '--mca btl mx,self' you are telling Open MPI 
> not to use its shared memory support. If you want to use Open 
> MPI's shared memory support, you must add 'sm' to the list. 
> I.e. '--mca btl mx,self'. If you would rather use MX's shared 
> memory support, instead use '--mca btl mx,self --mca 
> btl_mx_shared_mem 1'. However, in most cases I believe Open 
> MPI's shared memory support is a bit better.
> 
> Alternatively, if you don't specify any btls, Open MPI should 
> figure out what to use automatically.
> 
> Hope this helps,
> 
> Tim
> 
> SLIM H.A. wrote:
> > Hello
> > 
> > I have compiled openmpi-1.2.3 with the --with-mx=<directory> 
> > configuration and gcc compiler. On testing with 4-8 slots I get an 
> > error message, the mx ports are busy:
> > 
> >> mpirun --mca btl mx,self -np 4 ./cpi
> > [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with 
> > status=20 [node001:10074] mca_btl_mx_init: 
> mx_open_endpoint() failed 
> > with status=20 [node001:10073] mca_btl_mx_init: mx_open_endpoint() 
> > failed with status=20
> > 
> ----------------------------------------------------------------------
> > --
> > --
> > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > If you specified the use of a BTL component, you may have 
> forgotten a 
> > component (such as "self") in the list of usable components.
> > ... snipped
> > It looks like MPI_INIT failed for some reason; your 
> parallel process 
> > is likely to abort.  There are many reasons that a parallel process 
> > can fail during MPI_INIT; some of which are due to configuration or 
> > environment problems.  This failure appears to be an 
> internal failure; 
> > here's some additional information (which may only be 
> relevant to an 
> > Open MPI
> > developer):
> > 
> >   PML add procs failed
> >   --> Returned "Unreachable" (-12) instead of "Success" (0)
> > 
> ----------------------------------------------------------------------
> > --
> > --
> > *** An error occurred in MPI_Init
> > *** before MPI was initialized
> > *** MPI_ERRORS_ARE_FATAL (goodbye)
> > mpirun noticed that job rank 0 with PID 10071 on node 
> node001 exited 
> > on signal 1 (Hangup).
> > 
> > 
> > I would not expect mx messages as communication should not 
> go through 
> > the mx card? (This is a twin dual core  shared memory node) 
> The same 
> > happens when testing on 2 nodes, using a hostfile.
> > I checked the state of the mx card with mx_endpoint_info 
> and mx_info, 
> > they are healthy and free.
> > What is missing here?
> > 
> > Thanks
> > 
> > Henk
> > 
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>