Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem for multiple clusters using mpirun
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-21 09:56:26


On Mar 21, 2014, at 8:52 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> Looks like you don't have an IB connection between "master" and "node001"

+1

Assumedly, you have InfiniBand (or RoCE? Or iWARP?) installed on your cluster, right? (otherwise, the openib BTL won't be useful for you)

Note that most of the time, Open MPI will auto-pick the right BTLs for you -- there's usually no need to specify "--mca btl ...". You can usually just:

    mpirun -n 2 --host master,node001 your_mpi_program

and Open MPI will do the Right Thing.

To be clear: you usually only need to specify the BTL clause in odd circumstances.

In this case, you're trying to specify using the openib BTL, which means Open MPI will try to use InfiniBand, RoCE, or iWARP networking between the master and node001 servers. If you don't have that kind of networking connectivity between those servers (or if you didn't build Open MPI with verbs/OpenFabrics support), that's why Open MPI is giving you the error message that it is giving you -- Open MPI is basically saying "you don't seem to have InfiniBand / RoCE / iWARP connectivity between the master server and the node001 server".

> On Mar 21, 2014, at 12:43 AM, Hamid Saeed <e.hamidsaeed_at_[hidden]> wrote:
>
>> Hello All:
>>
>> I know there will be some one who can help me in solving this problem.
>>
>> • I can compile my helloworld.c program using mpicc and I have confirmed that the script runs correctly on another working cluster, so the local paths are set up correctly I think and the script definitely works.
>>
>> • If I execute mpirun from my master node, and using only the master node, helloworld executes correctly:
>>
>> mpirun -n 1 -host master --mca btl sm,openib,self ./helloworldmpi
>> hello world from process 0 of 1
>>
>> • If I execute mpirun from my master node, using only the worker node, helloworld executes correctly:
>>
>> mpirun -n 1 -host node001 --mca btl sm,openib,self./helloworldmpi
>> hello world from process 0 of 1
>>
>> Now, my problem is that if I try to run helloworld on both nodes, I get an error:
>>
>> mpirun -n 2 -host master,node001 --mca btl openib,self ./helloworldmpi
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[5228,1],0]) is on host: hsaeed
>> Process 2 ([[5228,1],1]) is on host: node001
>> BTLs attempted: self
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 7037 on
>> node xxxx exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> *** The MPI_Init() function was called before MPI_INIT was invoked.
>> *** This is disallowed by the MPI standard.
>> *** Your MPI job will now abort.
>> Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
>> Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> 1 more process has sent help message help-mpi-runtime
>>
>>
>> i tried using
>> mpirun -n 2 -host master,node001 --mca btl tcp,sm,self ./helloworldmpi
>> mpirun -n 2 -host master,node001 --mca btl o
>>
>> penib,tcp,
>> self ./helloworldmpi
>> etc..
>>
>> But no flag is works.
>>
>>
>> Can some one reply with the idea.
>>
>> Thanks in Advance.
>>
>> Regards--
>> --
>> _______________________________________________
>> Hamid Saeed
>> _______________________________________________
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/