Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-03-14 08:57:55


On Mar 14, 2006, at 4:42 AM, Pierre Valiron wrote:

> I am now attempting to tune openmpi-1.1a1r9260 on Solaris Opteron.

I guess I should have pointed this out more clearly earlier. Open
MPI 1.1a1 is a nightly build of alpha release from our development
trunk. It isn't guaranteed to be stable. About the only guarantee
made is that it passed "make distcheck" on the Linux box we use to
make tarballs.

The Solaris patches have been moved over to the v1.0 release branch,
so if stability is a concern, you might want to switch back to a
nightly tarball from the v1.0 release. We should also be having
another beta release of the 1.0.2 release in the near future.

> Each quadripro node possess two ethernet interfaces bge0 and bge1.
> Interfaces bge0 are dedicated to parallel jobs and correspond to node
> names pxx,
> they use a dedicated gigabit switch.
> Interfaces bge1 provide nfs sharing etc and correspond to node
> names nxx
> over another gigabit switch.
>
> 1) I allocated 4 quadripro nodes.
> As documented in the FAQ, mpirun -np 4 -hostfile $OAR_FILE_NODES
> runs 4
> tasks on the first SMP, and mpirun -np 4 -hostfile $OAR_FILE_NODES
> --bynode distributes a task on each node.
>
> 2) According to the users list, mpirun --mca pml teg should revert to
> 2nd generation TCP instead of default ob1 (3rd gen). Unfortunately
> I get
> the message
> No available pml components were found!
> Have you removed the 2nd generation TCP transport ? Do you consider
> the
> new ob1 is competitive now ?

On the development trunk, we have removed the TEG PML and all the
PTLs. The OB1 PML provides competitive (and most of the time better)
performance than the TEG PML for most transports. The major issue is
that when we added one-sided communication, we used the BTL
transports directly. The BTL and PTL frameworks were not designed to
live together, so issues were caused with the TEG PML.

> 3) According to the users list, tuned collective primitives are
> available. Apparently they are now compiled by default, but the don't
> seem functional at all:
>
> mpirun --mca coll tuned
> Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
> Failing at addr:0
> *** End of error message ***

Tuned collectives are available, but not as heavily tested as the
basic collectives. Do you have a test case in particular that causes
problems?

> 4) According to the FAQ and to the users list, openmpi attempts to
> discover and use all interfaces. I attempted to force using bge0 only
> with no success.
>
> mpirun --mca btl_tcp_if_exclude bge1
> [n33:04784] *** An error occurred in MPI_Barrier
> [n33:04784] *** on communicator MPI_COMM_WORLD
> [n33:04784] *** MPI_ERR_INTERN: internal error
> [n33:04784] *** MPI_ERRORS_ARE_FATAL (goodbye)
> 1 process killed (possibly by Open MPI)

That definitely shouldn't happen - Can you reconfigure / compile with
the option --enable-debug, then run with the added option --mca
btl_base_debug 2 and send the output you see to us? That might help
in diagnosing the problem.

> In the FAQ it is stated that a new syntax should be available soon. I
> tried if it is already implemented in openmpi-1.1a1r9260
>
> mpirun --mca btl_tcp_if ^bge0,bge1
> mpirun --mca btl_tcp_if ^bge1
> works with identical performances.

This syntax only works for specifying component names, not interface
names. So you would still need to use the btl_tcp_if_include and
btl_tcp_if_exclude options.

Brian

-- 
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/