Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI hangs on multiple nodes
From: Ole Nielsen (ole.moller.nielsen_at_[hidden])
Date: 2011-09-19 22:23:44


Hi all - and sorry for the multiple postings, but I have more information.

1: After a reboot of two nodes I ran again, and the inter-node freeze didn't
happen until the third iteration. I take that to mean that the basic
communication works, but that something is saturating. Is there some notion
of buffer size somewhere in the MPI system that could explain this?
2: The nodes have 4 ethernet cards each. Could the mapping be a problem?
3: The cpus are running at a 100% for all processes involved in the freeze
4: The same test program (
http://code.google.com/p/pypar/source/browse/source/mpi_test.c) works fine
when run within one node so the problem must be with MPI and/or our network.

5: The network and ssh works otherwise fine.

Again many thanks for any hint that can get us going again. The main thing
we need is some diagnostics that may point to what causes this problem for
MPI.
Cheers
Ole Nielsen

------

Here's the output which shows the freeze in the third iteration:

nielso_at_alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host
node5,node6 --npernode 2 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 2 on node node6
P2: Waiting to receive from to P1
P2: Sending to to P3
I am process 3 on node node6
P3: Waiting to receive from to P2
I am process 1 on node node5
P1: Waiting to receive from to P0
P1: Sending to to P2
P1: Waiting to receive from to P0
I am process 0 on node node5
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P3
P2: Waiting to receive from to P1
P3: Sending to to P0
P3: Waiting to receive from to P2
P1: Sending to to P2
P0: Received from to P3
Run 2 of 3
P0: Sending to P1
P0: Waiting to receive from P3
P1: Waiting to receive from to P0