Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: SLIM H.A. (h.a.slim_at_[hidden])
Date: 2007-07-10 07:20:27


Dear Tim

> So, you should just be able to run:
> mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
> ompi_machinefile ./cpi

I tried

node001>mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
ompi_machinefile ./cpi

I put in a sleep call to keep it running for some time and to monitor
the endpoints. None of the 4 were open, it must have used tcp.
Also when I look at the process table for node001 I find

orted --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename
node001 --universe dcl0has_at_node001:default-universe-17750 --nsreplica
"0.0.0;tcp://10.141.0.1:43640" --gprreplica
"0.0.0;tcp://10.141.0.1:43640" --set-sid

The argument "--num_procs 2" seems odd, I would expect 4?

Henk
 
> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> Sent: 09 July 2007 16:34
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
>
> SLIM H.A. wrote:
> >
> > Dear Tim and Scott
> >
> > I followed the suggestions made:
> >
> >> So you should either pass '-mca btl mx,sm,self', or just
> pass nothing
> >> at all.
> >> Open MPI is fairly smart at figuring out what components
> to use, so
> >> you really should not need to specify anything.
> >>
> >
> > Using
> >
> > node001>mpirun --mca btl mx,sm,self -np 4 -hostfile
> ompi_machinefile
> > ./cpi
> >
> > conects to some of the mx ports, not all 4, but the program runs:
> >
> > [node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with
> > status=20 [node001:01564] mca_btl_mx_init:
> mx_open_endpoint() failed
> > with status=20
>
> I finally figured out the problem here. What is happening is
> that Open MPI now has 2 different network stacks, only one of
> which can be used at a time: the mtl and the btl. What is
> happening is that both the mx btl and the mx mtl is being
> opened, each of which open an endpoint. Then the mtl is
> closed because it will not be used, which releases the endpoint.
> Meanwhile, while the number of endpoints are exhausted while
> others are trying to open them.
>
> There are two solutions:
> 1. Increase the number of available endpoints. According to
> the Myrinet documentation, upping the limit to 16 or so
> should have no performance impact.
>
> 2. Alternatively, you can tell the mx mtl not to run with -mca mtl ^mx
>
> So, you should just be able to run:
> mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile
> ompi_machinefile ./cpi
>
> And it should work.
>
> >
> > It spawned 4 processes on node001. Passing nothing at all gave the
> > same problem.
> >
> >> Also, could you try creating a host file named "hosts"
> with the names
> >> of your machines and then try:
> >>
> >> $ mpirun -np 2 --hostfile hosts ./cpi
> >>
> >> and then
> >>
> >> $ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi
> >
> > node001>mpirun -np 2 -hostfile ompi_machinefile ./cpi_gcc_ompi_mx
> >
> > works but increasing to 4 cores again uses less than 4 ports.
> > Finally
> >
> > node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm
> > ./cpi_gcc_ompi_mx
> >
> > is successful even for -np 4. From here I tried 2 nodes:
> >
> > node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm
> > ./cpi_gcc_ompi_mx
> >
> > This gave:
> >
> > orted: Command not found.
> > [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> > base/pls_base_orted_cmds.c at line 275 [node001:04585] [0,0,0]
> > ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
> > [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c
> > at line 90 [node001:04585] ERROR: A daemon on node node002
> failed to
> > start as expected.
> > [node001:04585] ERROR: There may be more information available from
> > [node001:04585] ERROR: the remote shell (see above).
> > [node001:04585] ERROR: The daemon exited unexpectedly with status 1.
> > [node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> > base/pls_base_orted_cmds.c at line 188 [node001:04585] [0,0,0]
> > ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196
> >
> ----------------------------------------------------------------------
> > --
> > --
> > mpirun was unable to cleanly terminate the daemons for this job.
> > Returned value Timeout instead of ORTE_SUCCESS.
> >
> >
> ----------------------------------------------------------------------
> > --
> > --
>
> The problem is that on the remote ompi cannot find the 'orted'
> executable. Is the Open MPI install available on the remote node?
>
> Try:
> ssh remote_node which orted
>
> This should locate the 'orted' program. If it does not, you
> may need to modify your PATH, as described here:
> http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
>
> Hope this helps,
>
> Tim
>
> >
> > Apparently orted is not started up properly. Something
> missing in the
> > installation?
> >
> > Thanks
> >
> > Henk
> >
> >
> >> -----Original Message-----
> >> From: users-bounces_at_[hidden]
> >> [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> >> Sent: 06 July 2007 15:59
> >> To: Open MPI Users
> >> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> >>
> >> Henk,
> >>
> >> On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote:
> >>> Dear Tim
> >>>
> >>> I followed the use of "--mca btl mx,self" as suggested in the FAQ
> >>>
> >>> http://www.open-mpi.org/faq/?category=myrinet#myri-btl
> >> Yeah, that FAQ is wrong. I am working right now to fix it up.
> >> It should be updated this afternoon.
> >>
> >>> When I use your extra mca value I get:
> >>>> mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi
> >>>
> >>
> ---------------------------------------------------------------------
> >> -
> >>> --
> >>> --
> >>>
> >>>> WARNING: A user-supplied value attempted to override the
> >> read-only
> >>>> MCA parameter named "btl_mx_shared_mem".
> >>>>
> >>>> The user-supplied value was ignored.
> >> Opps, on the 1.2 branch this is a read-only parameter. On
> the current
> >> trunk the user can change it. Sorry for the confusion. Oh
> well, you
> >> should probably use Open MPI's shared memory support
> instead anyways.
> >>
> >> So you should either pass '-mca btl mx,sm,self', or just
> pass nothing
> >> at all.
> >> Open MPI is fairly smart at figuring out what components
> to use, so
> >> you really should not need to specify anything.
> >>
> >>> followed by the same error messages as before.
> >>>
> >>> Note that although I add "self" the error messages
> complain about it
> >>>
> >>> missing:
> >>>>> Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> >>>>> If you specified the use of a BTL component, you may have
> >>>> forgotten a
> >>>>
> >>>>> component (such as "self") in the list of usable components.
> >>> I checked the output from mx_info for both the current node and
> >>> another, there seems not to be a problem.
> >>> I attch the output from ompi_info --all Also
> >>>
> >>>> ompi_info | grep mx
> >>> Prefix:
> >>> /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3
> >>> MCA btl: mx (MCA v1.0, API v1.0.1,
> >> Component v1.2.3)
> >>> MCA mtl: mx (MCA v1.0, API v1.0,
> Component v1.2.3)
> >>>
> >>> As a further check, I rebuild the exe with mpich and that
> >> works fine
> >>> on the same node over myrinet. I wonder whether mx is
> >> properly include
> >>> in my openmpi build.
> >>> Use of ldd -v on the mpich exe gives references to
> >> libmyriexpress.so,
> >>> which is not the case for the ompi built exe, suggesting
> >> something is
> >>> missing?
> >> No, this is expected behavior. The Open MPI executeables are not
> >> linked to libmyriexpress.so, only the mx components. So if
> you do a
> >> ldd on
> /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mc
> > a_btl_mx.so,
> >> this should show the Myrinet library.
> >>
> >>> I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1
> >>> and a listing of that directory is
> >>>
> >>>> ls /usr/local/Cluster-Apps/mx/mx-1.1.1
> >>> bin etc include lib lib32 lib64 sbin
> >>>
> >>> This should be sufficient, I don't need --with-mx-libdir?
> >> Correct.
> >>
> >>
> >> Hope this helps,
> >>
> >> Tim
> >>
> >>> Thanks
> >>>
> >>> Henk
> >>>
> >>>> -----Original Message-----
> >>>> From: users-bounces_at_[hidden]
> >>>> [mailto:users-bounces_at_[hidden]] On Behalf Of Tim Prins
> >>>> Sent: 05 July 2007 18:16
> >>>> To: Open MPI Users
> >>>> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> >>>>
> >>>> Hi Henk,
> >>>>
> >>>> By specifying '--mca btl mx,self' you are telling Open
> MPI not to
> >>>> use its shared memory support. If you want to use Open
> >> MPI's shared
> >>>> memory support, you must add 'sm' to the list.
> >>>> I.e. '--mca btl mx,self'. If you would rather use MX's
> >> shared memory
> >>>> support, instead use '--mca btl mx,self --mca
> >> btl_mx_shared_mem 1'.
> >>>> However, in most cases I believe Open MPI's shared memory
> >> support is
> >>>> a bit better.
> >>>>
> >>>> Alternatively, if you don't specify any btls, Open MPI
> >> should figure
> >>>> out what to use automatically.
> >>>>
> >>>> Hope this helps,
> >>>>
> >>>> Tim
> >>>>
> >>>> SLIM H.A. wrote:
> >>>>> Hello
> >>>>>
> >>>>> I have compiled openmpi-1.2.3 with the --with-mx=<directory>
> >>>>> configuration and gcc compiler. On testing with 4-8
> >> slots I get an
> >>>>> error message, the mx ports are busy:
> >>>>>> mpirun --mca btl mx,self -np 4 ./cpi
> >>>>> [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with
> >>>>> status=20 [node001:10074] mca_btl_mx_init:
> >>>> mx_open_endpoint() failed
> >>>>
> >>>>> with status=20 [node001:10073] mca_btl_mx_init:
> >> mx_open_endpoint()
> >>>>> failed with status=20
> >>>>
> >>
> --------------------------------------------------------------------
> >>>> --
> >>>>
> >>>>> --
> >>>>> --
> >>>>> Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> >>>>> If you specified the use of a BTL component, you may have
> >>>> forgotten a
> >>>>
> >>>>> component (such as "self") in the list of usable components.
> >>>>> ... snipped
> >>>>> It looks like MPI_INIT failed for some reason; your
> >>>> parallel process
> >>>>
> >>>>> is likely to abort. There are many reasons that a parallel
> >>>>> process can fail during MPI_INIT; some of which are due to
> >>>>> configuration or environment problems. This failure
> >> appears to be
> >>>>> an
> >>>> internal failure;
> >>>>
> >>>>> here's some additional information (which may only be
> >>>> relevant to an
> >>>>
> >>>>> Open MPI
> >>>>> developer):
> >>>>>
> >>>>> PML add procs failed
> >>>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
> >>>>
> >>
> --------------------------------------------------------------------
> >>>> --
> >>>>
> >>>>> --
> >>>>> --
> >>>>> *** An error occurred in MPI_Init
> >>>>> *** before MPI was initialized
> >>>>> *** MPI_ERRORS_ARE_FATAL (goodbye) mpirun noticed that
> >> job rank 0
> >>>>> with PID 10071 on node
> >>>> node001 exited
> >>>>
> >>>>> on signal 1 (Hangup).
> >>>>>
> >>>>>
> >>>>> I would not expect mx messages as communication should not
> >>>> go through
> >>>>
> >>>>> the mx card? (This is a twin dual core shared memory node)
> >>>> The same
> >>>>
> >>>>> happens when testing on 2 nodes, using a hostfile.
> >>>>> I checked the state of the mx card with mx_endpoint_info
> >>>> and mx_info,
> >>>>
> >>>>> they are healthy and free.
> >>>>> What is missing here?
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> Henk
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>