Jeff hinted the real problem in his email. Even if the program use the
correct MPI functions, it is not 100% correct. It might pass in some
situations, but can lead to fake "deadlocks" in others. The problem
come from the flow control. If the messages are small (which is the
case in the test example), Open MPI will send them eagerly. Without a
flow control, these messages will be buffered by the receiver, which
will exhaust the memory on the receiver. Once this happens, some of
the messages may get dropped, but the most visible result, is that the
progress will happens very (VERY) slowly.
Adding a MPI_Barrier every 100 iterations will solve the problem.
PS: A very similar problem was discussed on the mailing list few days
ago. Please read the thread to see a more detailed explanation, as
well as another solution to solve it.
On Mar 18, 2008, at 7:48 AM, Andreas Schäfer wrote:
> OK, this is strange. I've rerun the test and got it to block,
> too. Although repeated tests show that those are rare (sometimes the
> program runs smoothly without blocking, but in about 30% of the cases
> it hangs just like you said).
> On 08:11 Tue 18 Mar , Giovani Faccin wrote:
>> I'm using openmpi-1.2.5. It was installed using my distro's
>> (Gentoo) default package:
>> sys-cluster/openmpi-1.2.5 USE="fortran ipv6 -debug -heterogeneous -
>> nocxx -pbs -romio -smp -threads"
> Just like me.
> I've attached gdb to all three processes. On rank 0 I get the
> following backtrace:
> (gdb) bt
> #0 0x00002ada849b3f16 in mca_btl_sm_component_progress ()
> from /usr/lib64/openmpi/mca_btl_sm.so
> #1 0x00002ada845a71da in mca_bml_r2_progress () from /usr/lib64/
> #2 0x00002ada7e6fbbea in opal_progress () from /usr/lib64/libopen-
> #3 0x00002ada8439a9a5 in mca_pml_ob1_recv () from /usr/lib64/
> #4 0x00002ada7e2570a8 in PMPI_Recv () from /usr/lib64/libmpi.so.0
> #5 0x000000000040c9ae in MPI::Comm::Recv ()
> #6 0x0000000000409607 in main ()
> On rank 1:
> (gdb) bt
> #0 0x00002baa6869bcc0 in mca_btl_sm_send () from /usr/lib64/openmpi/
> #1 0x00002baa6808a93d in mca_pml_ob1_send_request_start_copy ()
> from /usr/lib64/openmpi/mca_pml_ob1.so
> #2 0x00002baa680855f6 in mca_pml_ob1_send () from /usr/lib64/
> #3 0x00002baa61f43182 in PMPI_Send () from /usr/lib64/libmpi.so.0
> #4 0x000000000040ca04 in MPI::Comm::Send ()
> #5 0x0000000000409700 in main ()
> On rank 2:
> (gdb) bt
> #0 0x00002b933d555ac7 in sched_yield () from /lib/libc.so.6
> #1 0x00002b9341efe775 in mca_pml_ob1_send () from /usr/lib64/
> #2 0x00002b933bdbc182 in PMPI_Send () from /usr/lib64/libmpi.so.0
> #3 0x000000000040ca04 in MPI::Comm::Send ()
> #4 0x0000000000409700 in main ()
> Anyone got a clue?
> Andreas Schäfer
> Cluster and Metacomputing Working Group
> Friedrich-Schiller-Universität Jena, Germany
> PGP/GPG key via keyserver
> I'm a bright... http://www.the-brights.net
> This is Bunny. Copy and paste Bunny into your
> signature to help him gain world domination!
> users mailing list
- application/pkcs7-signature attachment: smime.p7s