Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Prioritization of --mca btl openib,tcp,self
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-11-23 07:12:04


On 11/22/2010 08:18 PM, Paul Monday (Parallel Scientific) wrote:
> This is a follow-up to an earlier question, I'm trying to understand how --mca btl prioritizes it's choice for connectivity. Going back to my original network, there are actually two networks running around. A point to point Infiniband network that looks like this (with two fabrics):
>
> A(port 1)(opensm)------>B
> A(port 2)(opensm)------>C
>
> The original question queried whether there was a way to avoid the problem of B and C not being able to talk to each other if I were to run
>
> mpirun -host A,B,C --mca btl openib,self -d /mnt/shared/apps/myapp
>
> "At least one pair of MPI processes are unable to reach each other for
> MPI communications." ...
>
> There is an additional network though, I have an ethernet management network that connects to all nodes. If our program retrieves the ranks from the nodes using TCP and then can shift to openib, that would be interesting and, in fact, if I run
>
> mpirun -host A,B,C --mca btl openib,tcp,self -d /mnt/shared/apps/myapp
>
> The program does, in fact, run cleanly.
>
> But, the question I have now is does MPI "choose" to use tcp when it can find all nodes and then always use tcp, or will it fall back to openib if it can?
For MPI communications (as opposed to the ORTE communications) the
library will try and pick out the most performant protocol to use for
communications between two nodes. So in your case A-B and A-C should
use the openib btl and B-C should use the tcp btl.
> So ... more succinctly:
> Given a list of btls, such as openib,tcp,self, and a program can only broadcast on tcp but individual operations can occur over openib between nodes, will mpirun use the first interconnect that works for each operation or once it finds one that the broadcast phase works on will it use that one permanently?
If by broadcast you mean MPI_Bcast, this is actually done using point to
point algorithms so the communications will happen over a mixture of IB
and TCP.

If you mean something else by broadcast you'll need to clarify what you
mean because there really isn't a direct use of protocol broadcasts in
MPI or even ORTE to my knowledge.
> And, as a follow-up, can I turn off the attempt to broadcast to touch all nodes?
See above.
> Paul Monday
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture