Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-03-22 07:33:05


Is this a TCP-based cluster?

If so, do you have multiple IP addresses on your frontend machine?
Check out these two FAQ entries to see if they help:

http://www.open-mpi.org/faq/?category=tcp#tcp-routability
http://www.open-mpi.org/faq/?category=tcp#tcp-selection

On Mar 21, 2007, at 4:51 PM, tim gunter wrote:

> i am experiencing some issues w/ openmpi 1.2 running on a rocks
> 4.2.1 cluster(the issues also appear to occur w/ openmpi 1.1.5 and
> 1.1.4).
>
> when i run my program with the frontend in the list of nodes, they
> deadlock.
>
> when i run my program without the frontend in the list of nodes,
> they run to completion.
>
> the simplest test program that does this(test1.c) does an
> "MPI_Init", followed by an "MPI_Barrier", and a "MPI_Finalize".
>
> so the following deadlocks:
>
> /users/gunter $ mpirun -np 3 -H
> frontend,compute-0-0,compute-0-1 ./test1
> host:compute-0-1.local made it past the barrier, ret:0
> mpirun: killing job...
>
> mpirun noticed that job rank 0 with PID 15384 on node frontend
> exited on signal 15 (Terminated).
> 2 additional processes aborted (not shown)
>
> this runs to completion:
>
> /users/gunter $ mpirun -np 3 -H
> compute-0-0,compute-0-1,compute-0-2 ./test1
> host:compute-0-1.local made it past the barrier, ret:0
> host:compute-0-0.local made it past the barrier, ret:0
> host:compute-0-2.local made it past the barrier, ret:0
>
> if i have the compute nodes send a message to the frontend prior to
> the barrier, it runs to completion:
>
> /users/gunter $ mpirun -np 3 -H
> frontend,compute-0-0,compute-0-1 ./test2 0
> host: frontend.domain node: 0 is the master
> host: compute-0-0.local node: 1 sent: 1 to: 0
> host: compute-0-1.local node: 2 sent: 2 to: 0
> host: frontend.domain node: 0 recv: 1 from: 1
> host: frontend.domain node: 0 recv: 2 from: 2
> host: frontend.domain made it past the barrier, ret:0
> host: compute-0-1.local made it past the barrier, ret:0
> host: compute-0-0.local made it past the barrier, ret:0
>
> if i have a different node function as the master, it deadlocks:
>
> /users/gunter $ mpirun -np 3 -H
> frontend,compute-0-0,compute-0-1 ./test2 1
> host: compute-0-0.local node: 1 is the master
> host: compute-0-1.local node: 2 sent: 2 to: 1
> mpirun: killing job...
>
> mpirun noticed that job rank 0 with PID 15411 on node frontend
> exited on signal 15 (Terminated).
> 2 additional processes aborted (not shown)
>
> how is it that in the first example, one node makes it past the
> barrier, and the rest deadlock?
>
> these programs both run to completion on two other MPI
> implementations.
>
> is there something mis-configured on my cluster? or is this
> potentially an openmpi bug?
>
> what is the best way to debug this?
>
> any help would be appreciated!
>
> --tim
> <test1.c>
> <test2.c>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems