On Fri, Jul 08, 2011 at 07:03:13PM -0400, Jeff Squyres wrote:
> Sorry -- I got distracted all afternoon...
No problem. We all have obligations that we prioritize.
> In addition to what Ralph said (i.e., I'm not sure if the
> CIDR notation stuff made it over to the v1.5 branch or not,
> but it is available from the nightly SVN trunk tarballs:
> http://www.open-mpi.org/nightly/trunk/), here's a few points
> from other mails in this thread...
I try this out sometime next week.
> 1. Gus is correct that OMPI is complaining that bge1 doesn't
> exist on all nodes. The MCA parameters that you pass on the
> command line get shipped to *all* MPI processes, and therefore
> generally need to work on all of them. If you have per-host
> MCA parameter values, you can set them a few different ways:
> - have a per-host MCA param file, usually in
> - have your shell startup files intelligently determine which
> host you're on and set the corresponding MCA environment variable
> as appropriate (e.g., on the head node, set the env variable
> OMPI_MCA_btl_tcp_if_include to bge1, and set it to bge0 on the others)
> Those are a little klunky, but having a heterogeneous setup like this
> is not common, so we haven't really optimized the ability to set
> different MCA params on different servers.
There is no compelling reason for me to keep bge0 on the 10.208.
subnet and bge1 on the 192.168 subnet on node10. If I switch
the two, so all bge0 nics are on 192.168., then I suppose
that --mca btl_tcp_if_include bge0 should work. I'll try
this next weekr; if I can kick everyone off the cluster for
a few minutes.
> 2. I am curious to figure out why the automatic reachability
> computations isn't working for you. Unfortunately, the code
> to compute the reachability is pretty gnarly. :-\ The code
> to find the IP interfaces on your machines is in opal/util/if.c.
> That *should* be working -- there's *BSD-specific code in there
> that has been verified by others in the past... but who knows?
> Perhaps it has bit-rotted...?
I'm running a Feb 2011 version of the bleeding edge FreeBSD,
which will become FreeBSD 9.0 is a few months. Perhaps,
something has changed in FreeBSD's networking code. I'll
see if I can understand opal/util/if.c sufficiently to see
> The code to take these IP interfaces
> and figure out if a given peer is reachable is in
> This requires a little explanation...
(snip to keep this short)
> This was a long explanation -- I hope it helps...
> Is there any chance you could dig into this to see what's going on?
Thanks, I'll see what I can ferret out of the syste
> We unfortunately don't have access to any BSD machines to test this
> on, ourselves. It works on other OS's, so I'm curious as to why it
> doesn't seem to work for you. :-(
I can arrange access on the cluster in question. ;-)