Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Troy Telford (ttelford_at_[hidden])
Date: 2005-10-27 12:30:53

I've been running a number of benchmarks & tests with OpenMPI 1.0rc4.
I've run into a few issues that I believe are related to OpenMPI; if they
aren't, I'd appreciate the education. :)

The attached tarball does not have the MPICH variant results (the tarball
is 87 kb as it is)

I can run the same tests with MVAPICH, MPICH-GM, and MPICH-MX with no
problems. The benchmarks were built from source rpm's (that I maintain),
so I can say the build procedure for the benchmarks is essentially
identical from one MPI to another.

A short summary:
* Identical hardware, except for the interconnect.
* Linux, SLES 9 SP2, kernel 2.6.5-7.201-smp (SLES binary)
* Opteron 248's, two CPU's per node, 4 GB per node.
* Four nodes in every test run.

I used the following interconnects/drivers:
* Myrinet (GM 2.0.22 and MX 1.0.3)
* Infiniband (Mellanox "IB Gold" 1.8)

And the following benchmarks/tests:
* HPC Challenge (v1.0)
* HPL (v1.0)
* Intel MPI Benchmark (IMB, formerly PALLAS) v2.3
* Presta MPI Benchmarks

Quick summary of results:

HPC Challenge:
* Never completed an entire run on any interconnect
        - MVAPI came close; crashed after the HPL section.
                -Error messages:
                [n60:21912] *** An error occurred in MPI_Reduce
                [n60:21912] *** on communicator MPI_COMM_WORLD
                [n60:21912] *** MPI_ERR_OP: invalid reduce operation
        - GM wedges itself in the HPL section
        - MX crashes during the PTRANS test (the first test performed)
(See earlier thread on this list about OpenMPI wedging itself; I did apply
that workaround).

* Only completes with one interconnect:
        - MVAPI mca btl works fine.
        - GM wedges itself, similar to HPCC
        - MX gives an error: MX: assertion: <<not yet implemented>> failed at
line 281, file ../mx__shmem.c

* Only completes with one interconnect:
        - MVAPI mca btl works fine.
        - GM fails, but differs in which portion of the benchmark it gets stuck
        - MX fails, offering both the error listed in the HPL section, as well as:
                "mx_connect fail for 0th remote address key deadbeef (error Operation

* Completes with varying degrees of success
        - MVAPI: Completes successfully
                -But the 'all reduction' test is 173 times slower than the same test on
GM, and is 360 times slower than with MX.
        - GM: Does not complete the 'com' test; simply stops at the same point
every time (I have it included in my logs)
        - MX: Completes successfully, but I do receive the "mx_connect fail for
0th remote address key deadbeef (error Operation timed-out)" message.

I hope I've provided enough information to be useful; if not, just ask and
I'll help out as much as I can.