Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
From: Steve Kargl (sgk_at_[hidden])
Date: 2011-07-08 19:34:54


On Fri, Jul 08, 2011 at 07:03:13PM -0400, Jeff Squyres wrote:
> Sorry -- I got distracted all afternoon...

No problem. We all have obligations that we prioritize.

> In addition to what Ralph said (i.e., I'm not sure if the
> CIDR notation stuff made it over to the v1.5 branch or not,
> but it is available from the nightly SVN trunk tarballs:
> http://www.open-mpi.org/nightly/trunk/), here's a few points
> from other mails in this thread...

I try this out sometime next week.

> 1. Gus is correct that OMPI is complaining that bge1 doesn't
> exist on all nodes. The MCA parameters that you pass on the
> command line get shipped to *all* MPI processes, and therefore
> generally need to work on all of them. If you have per-host
> MCA parameter values, you can set them a few different ways:
>
> - have a per-host MCA param file, usually in
> $prefix/etc/openmpi-mca-params.conf
> - have your shell startup files intelligently determine which
> host you're on and set the corresponding MCA environment variable
> as appropriate (e.g., on the head node, set the env variable
> OMPI_MCA_btl_tcp_if_include to bge1, and set it to bge0 on the others)
>
> Those are a little klunky, but having a heterogeneous setup like this
> is not common, so we haven't really optimized the ability to set
> different MCA params on different servers.

There is no compelling reason for me to keep bge0 on the 10.208.
subnet and bge1 on the 192.168 subnet on node10. If I switch
the two, so all bge0 nics are on 192.168., then I suppose
that --mca btl_tcp_if_include bge0 should work. I'll try
this next weekr; if I can kick everyone off the cluster for
a few minutes.

> 2. I am curious to figure out why the automatic reachability
> computations isn't working for you. Unfortunately, the code
> to compute the reachability is pretty gnarly. :-\ The code
> to find the IP interfaces on your machines is in opal/util/if.c.
> That *should* be working -- there's *BSD-specific code in there
> that has been verified by others in the past... but who knows?
> Perhaps it has bit-rotted...?

I'm running a Feb 2011 version of the bleeding edge FreeBSD,
which will become FreeBSD 9.0 is a few months. Perhaps,
something has changed in FreeBSD's networking code. I'll
see if I can understand opal/util/if.c sufficiently to see
what's happening.

> The code to take these IP interfaces
> and figure out if a given peer is reachable is in
> ompi/mca/btl/tcp/btl_tcp_proc.c:mca_btl_tcp_proc_insert().
> This requires a little explanation...

(snip to keep this short)

> This was a long explanation -- I hope it helps...
> Is there any chance you could dig into this to see what's going on?

Thanks, I'll see what I can ferret out of the syste

> We unfortunately don't have access to any BSD machines to test this
> on, ourselves. It works on other OS's, so I'm curious as to why it
> doesn't seem to work for you. :-(

I can arrange access on the cluster in question. ;-)

-- 
Steve