Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI hangs on multiple nodes
From: Ole Nielsen (ole.moller.nielsen_at_[hidden])
Date: 2011-09-19 04:39:00


Further to the posting below, I can report that the test program (attached -
this time correctly) is chewing up CPU time on both compute nodes for as
long as I care to let it continue.
It would appear that MPI_Receive which is the next command after the print
statements in the test program.

Has anyone else seen this behavior or can anyone give me a hint on how to
troubleshoot.
Cheers and thanks
Ole Nielsen

Output:
nielso_at_alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts --host
node17,node18 --npernode 2 a.out
Number of processes = 4
Test repeated 3 times for reliability
I am process 2 on node node18
P2: Waiting to receive from to P1
I am process 0 on node node17
Run 1 of 3
P0: Sending to P1
I am process 1 on node node17
P1: Waiting to receive from to P0
I am process 3 on node node18
P3: Waiting to receive from to P2
P0: Waiting to receive from P3
P1: Sending to to P2
P1: Waiting to receive from to P0
P2: Sending to to P3
P0: Received from to P3
Run 2 of 3
P0: Sending to P1
P3: Sending to to P0
P3: Waiting to receive from to P2
P2: Waiting to receive from to P1
P1: Sending to to P2
P0: Waiting to receive from P3

On Mon, Sep 19, 2011 at 11:04 AM, Ole Nielsen
<ole.moller.nielsen_at_[hidden]>wrote:

>
> Hi all
>
> We have been using OpenMPI for many years with Ubuntu on our 20-node
> cluster. Each node has 2 quad cores, so we usually run up to 8 processes on
> each node up to a maximum of 160 processes.
>
> However, we just upgraded the cluster to Ubuntu 11.04 with Open MPI 1.4.3
> and and have come across a strange behavior where mpi programs run perfectly
> well when confined to one node but hangs during communication across
> multiple nodes. We have no idea why and would like some help in debugging
> this. A small MPI test program is attached and typical output shown below.
>
> Hope someone can help us
> Cheers and thanks
> Ole Nielsen
>
> -------------------- Test output across two nodes (This one hangs)
> --------------------------
> nielso_at_alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts
> --host node17,node18 --npernode 2 a.out
> Number of processes = 4
> Test repeated 3 times for reliability
> I am process 1 on node node17
> P1: Waiting to receive from to P0
> I am process 0 on node node17
> Run 1 of 3
> P0: Sending to P1
> I am process 2 on node node18
> P2: Waiting to receive from to P1
> I am process 3 on node node18
> P3: Waiting to receive from to P2
> P1: Sending to to P2
>
>
> -------------------- Test output within one node (This one is OK)
> --------------------------
> nielso_at_alamba:~/sandpit/pypar/source$ mpirun --hostfile /etc/mpihosts
> --host node17 --npernode 4 a.out
> Number of processes = 4
> Test repeated 3 times for reliability
> I am process 2 on node node17
> P2: Waiting to receive from to P1
> I am process 0 on node node17
> Run 1 of 3
> P0: Sending to P1
> I am process 1 on node node17
> P1: Waiting to receive from to P0
> I am process 3 on node node17
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Waiting to receive from to P0
> P2: Waiting to receive from to P1
> P3: Sending to to P0
> P0: Received from to P3
> Run 2 of 3
> P0: Sending to P1
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Waiting to receive from to P0
> P3: Sending to to P0
> P2: Waiting to receive from to P1
> P0: Received from to P3
> Run 3 of 3
> P0: Sending to P1
> P3: Waiting to receive from to P2
> P1: Sending to to P2
> P2: Sending to to P3
> P1: Done
> P2: Done
> P3: Sending to to P0
> P0: Received from to P3
> P0: Done
> P3: Done
>
>
>
>