Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Multiple Subnet MPI Fail
From: Paul Monday (Parallel Scientific) (paul.monday_at_[hidden])
Date: 2010-11-22 18:19:18

Thanks for the quick response ... I've been thinking about this today and tried a few things on my CentOS mini connected cluster ...

To use tcp btl I will have to set up a bridge on A with ib0 and ib1 participating in the bridge, then tcp btl could be used as you suggest. Unfortunately, the obvious solution to use bridge-utils on CentOS does not support Infiniband adapters.

This is now straying out of MPI range to a networking issue ... any ideas would be greatly appreciated on bridging at the IP over IB tier in a cluster. This must be a solved problem but I'm not having a lot of luck with google and the archives.

Paul Monday

On Nov 22, 2010, at 7:46 AM, Terry Dontje wrote:

> You're gonna have to use a protocol that can route through a machine, OFED User Verbs (ie openib) does not do this. The only way I know of to do this via OMPI is with the tcp btl.
> --td
> On 11/22/2010 09:28 AM, Paul Monday (Parallel Scientific) wrote:
>> We've been using OpenMPI in a switched environment with success, but we've moved to a point to point environment to do some work. Some of the nodes cannot talk directly to one another, sort of like this with computers A,B, C with A having two ports:
>> A(1)(opensm)------>B
>> A(2)(opensm)------>C
>> B is not connected to C in any way.
>> When we try to run our OpenMPI program we are receiving:
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> Process 1 ([[1581,1],5]) is on host: pg-B
>> Process 2 ([[1581,1],0]) is on host: pg-C
>> BTLs attempted: openib self sm
>> Your MPI job is now going to abort; sorry.
>> I hope I'm not being overly naive but, is their a way to join the subnets at the MPI layer? It seems like IP over IB would be too high up the stack.
>> Paul Monday
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> --
> <Mail Attachment.gif>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]