Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Joachim Worringen (joachim_at_[hidden])
Date: 2005-08-30 05:01:43


Dear *,

I'm currently testing OpenMPI 1.0a1r7026 on a Linux 2.6.6 32-node Dual-Athlon
cluster with Myrinet (GM 2.1.1 on M3M-PCI64C boards). gcc is 3.3.3. 4GB RAM per
node.

Compilation from the snapshot and startup went fine, congratulations. Surely not
trivial.

Point-to-point tests (mpptest) pass. However, running a rather simple benchmark
to test the performance of collective operations (not PMB, but a custom one)
seems to deadlock. So far, I could figure out:
- using btl 'gm' (default)
   o 16 processes on 8 nodes: "deadlock" in Allreduce
   o 2 processes on 2 nodes: "deadlock" in Reduce_scatter
- explicitely using btl 'tcp'
   o 2 processes on 2 nodes: "deadlock" in Reduce_scatter

Additionally, I sporadically receive SEGV's using gm:
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0 0x00000000 in ?? ()
#1 0x4006d04c in mca_mpool_base_registration_destructor () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#2 0x40179a0c in mca_mpool_gm_free () from
/home/joachim/local/open-mpi//lib/openmpi/mca_mpool_gm.so
#3 0x4006cf9c in mca_mpool_base_free () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#4 0x4004efbc in PMPI_Free_mem () from /home/joachim/local/open-mpi/lib/libmpi.so.0
#5 0x0804b1c9 in main ()

Sometimes, this seems to happen when aborting an application (via CTRL-C to mpirun):
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0 0x401d0633 in mca_btl_tcp_proc_remove () from
/home/joachim/local/open-mpi//lib/openmpi/mca_btl_tcp.so
Cannot access memory at address 0xbfffe2bc

Of course, I'm not sure if the deadlock really is a deadlock, but the respective
tests takes way to much time. Needless to say that other MPI implementations run
this benchmark (which we are using for some time on a variety of platforms)
reliably on the same machine (MPICH-GM, our own MPI).

Any ideas or comments? I will try to run PMB.

  Joachim

-- 
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de