Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI hangs on multiple nodes
From: Ole Nielsen (ole.moller.nielsen_at_[hidden])
Date: 2011-09-19 00:04:38


Hi all

We have been using OpenMPI for many years with Ubuntu on our 20-node
cluster. Each node has 2 quad cores, so we usually run up to 8 processes on
each node up to a maximum of 160 processes.

However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3
and and have come across a strange behavior where mpi programs run perfectly
well when confined to one node but hangs during communication across
multiple nodes. We have no idea why and would like some help in debugging
this. A small MPI test program is attached and typical output shown below.

Hope someone can help us
Cheers and thanks
Ole Nielsen

-------------------- Test output across two nodes (This one hangs)
--------------------------
nielso_at_alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host
node17,node18 --npernode 2 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 1 on node node17
P1: Waiting to receive from to P0
I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 2 on node node18
P2: Waiting to receive from to P1
I am process 3 on node node18
P3: Waiting to receive from to P2
P1: Sending to to P2

-------------------- Test output within one node (This one is OK)
--------------------------
nielso_at_alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host
node17 --npernode 4 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 2 on node node17
P2: Waiting to receive from to P1
I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 1 on node node17
P1: Waiting to receive from to P0
I am process 3 on node node17
P3: Waiting to receive from to P2
P1: Sending to to P2
P2: Sending to to P3
P1: Waiting to receive from to P0
P2: Waiting to receive from to P1
P3: Sending to to P0
P0: Received from to P3
Run 2 of 3
P0: Sending to P1
P3: Waiting to receive from to P2
P1: Sending to to P2
P2: Sending to to P3
P1: Waiting to receive from to P0
P3: Sending to to P0
P2: Waiting to receive from to P1
P0: Received from to P3
Run 3 of 3
P0: Sending to P1
P3: Waiting to receive from to P2
P1: Sending to to P2
P2: Sending to to P3
P1: Done
P2: Done
P3: Sending to to P0
P0: Received from to P3
P0: Done
P3: Done