Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-08-30 11:24:01


Greetings!

We actually had some problems in some of our collectives with some
optimizations that were added in the last month or so, and we just
noticed/corrected them yesterday. It looks like your tarball is about
a week old or so -- you might want to update to a newer one. Last
night's tarball should include all the fixes that we made yesterday;
I'm artifically making another one right now that includes some fixes
from this morning.

Thanks for your patience; we're actually getting pretty close to
stable, but aren't quite there yet...

On Aug 30, 2005, at 6:01 AM, Joachim Worringen wrote:

>
> Dear *,
>
> I'm currently testing OpenMPI 1.0a1r7026 on a Linux 2.6.6 32-node
> Dual-Athlon
> cluster with Myrinet (GM 2.1.1 on M3M-PCI64C boards). gcc is 3.3.3.
> 4GB RAM per
> node.
>
> Compilation from the snapshot and startup went fine, congratulations.
> Surely not
> trivial.
>
> Point-to-point tests (mpptest) pass. However, running a rather simple
> benchmark
> to test the performance of collective operations (not PMB, but a
> custom one)
> seems to deadlock. So far, I could figure out:
> - using btl 'gm' (default)
> o 16 processes on 8 nodes: "deadlock" in Allreduce
> o 2 processes on 2 nodes: "deadlock" in Reduce_scatter
> - explicitely using btl 'tcp'
> o 2 processes on 2 nodes: "deadlock" in Reduce_scatter
>
> Additionally, I sporadically receive SEGV's using gm:
> Core was generated by `collmeas_open-mpi'.
> Program terminated with signal 11, Segmentation fault.
> (gdb) bt
> #0 0x00000000 in ?? ()
> #1 0x4006d04c in mca_mpool_base_registration_destructor () from
> /home/joachim/local/open-mpi/lib/libmpi.so.0
> #2 0x40179a0c in mca_mpool_gm_free () from
> /home/joachim/local/open-mpi//lib/openmpi/mca_mpool_gm.so
> #3 0x4006cf9c in mca_mpool_base_free () from
> /home/joachim/local/open-mpi/lib/libmpi.so.0
> #4 0x4004efbc in PMPI_Free_mem () from
> /home/joachim/local/open-mpi/lib/libmpi.so.0
> #5 0x0804b1c9 in main ()
>
> Sometimes, this seems to happen when aborting an application (via
> CTRL-C to mpirun):
> Core was generated by `collmeas_open-mpi'.
> Program terminated with signal 11, Segmentation fault.
> (gdb) bt
> #0 0x401d0633 in mca_btl_tcp_proc_remove () from
> /home/joachim/local/open-mpi//lib/openmpi/mca_btl_tcp.so
> Cannot access memory at address 0xbfffe2bc
>
> Of course, I'm not sure if the deadlock really is a deadlock, but the
> respective
> tests takes way to much time. Needless to say that other MPI
> implementations run
> this benchmark (which we are using for some time on a variety of
> platforms)
> reliably on the same machine (MPICH-GM, our own MPI).
>
> Any ideas or comments? I will try to run PMB.
>
> Joachim
>
> --
> Joachim Worringen - NEC C&C research lab St.Augustin
> fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/