I've been running a number of benchmarks & tests with OpenMPI 1.0rc4.
I've run into a few issues that I believe are related to OpenMPI; if they
aren't, I'd appreciate the education. :)
The attached tarball does not have the MPICH variant results (the tarball
is 87 kb as it is)
I can run the same tests with MVAPICH, MPICH-GM, and MPICH-MX with no
problems. The benchmarks were built from source rpm's (that I maintain),
so I can say the build procedure for the benchmarks is essentially
identical from one MPI to another.
A short summary:
* Identical hardware, except for the interconnect.
* Linux, SLES 9 SP2, kernel 2.6.5-7.201-smp (SLES binary)
* Opteron 248's, two CPU's per node, 4 GB per node.
* Four nodes in every test run.
I used the following interconnects/drivers:
* Myrinet (GM 2.0.22 and MX 1.0.3)
* Infiniband (Mellanox "IB Gold" 1.8)
And the following benchmarks/tests:
* HPC Challenge (v1.0)
* HPL (v1.0)
* Intel MPI Benchmark (IMB, formerly PALLAS) v2.3
* Presta MPI Benchmarks
Quick summary of results:
HPC Challenge:
* Never completed an entire run on any interconnect
- MVAPI came close; crashed after the HPL section.
-Error messages:
[n60:21912] *** An error occurred in MPI_Reduce
[n60:21912] *** on communicator MPI_COMM_WORLD
[n60:21912] *** MPI_ERR_OP: invalid reduce operation
- GM wedges itself in the HPL section
- MX crashes during the PTRANS test (the first test performed)
(See earlier thread on this list about OpenMPI wedging itself; I did apply
that workaround).
HPL:
* Only completes with one interconnect:
- MVAPI mca btl works fine.
- GM wedges itself, similar to HPCC
- MX gives an error: MX: assertion: <<not yet implemented>> failed at
line 281, file ../mx__shmem.c
IMB:
* Only completes with one interconnect:
- MVAPI mca btl works fine.
- GM fails, but differs in which portion of the benchmark it gets stuck
at.
- MX fails, offering both the error listed in the HPL section, as well as:
"mx_connect fail for 0th remote address key deadbeef (error Operation
timed-out)"
Presta:
* Completes with varying degrees of success
- MVAPI: Completes successfully
-But the 'all reduction' test is 173 times slower than the same test on
GM, and is 360 times slower than with MX.
- GM: Does not complete the 'com' test; simply stops at the same point
every time (I have it included in my logs)
- MX: Completes successfully, but I do receive the "mx_connect fail for
0th remote address key deadbeef (error Operation timed-out)" message.
I hope I've provided enough information to be useful; if not, just ask and
I'll help out as much as I can.
|