Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: tim gunter (tgunter_at_[hidden])
Date: 2007-03-21 16:51:09


i am experiencing some issues w/ openmpi 1.2 running on a rocks
4.2.1cluster(the issues also appear to occur w/ openmpi
1.1.5 and 1.1.4).

when i run my program with the frontend in the list of nodes, they deadlock.

when i run my program without the frontend in the list of nodes, they run to
completion.

the simplest test program that does this(test1.c) does an "MPI_Init",
followed by an "MPI_Barrier", and a "MPI_Finalize".

so the following deadlocks:

    /users/gunter $ mpirun -np 3 -H frontend,compute-0-0,compute-0-1 ./test1
    host:compute-0-1.local made it past the barrier, ret:0
    mpirun: killing job...

    mpirun noticed that job rank 0 with PID 15384 on node frontend exited on
signal 15 (Terminated).
    2 additional processes aborted (not shown)

this runs to completion:

    /users/gunter $ mpirun -np 3 -H compute-0-0,compute-0-1,compute-0-2
./test1
    host:compute-0-1.local made it past the barrier, ret:0
    host:compute-0-0.local made it past the barrier, ret:0
    host:compute-0-2.local made it past the barrier, ret:0

if i have the compute nodes send a message to the frontend prior to the
barrier, it runs to completion:

    /users/gunter $ mpirun -np 3 -H frontend,compute-0-0,compute-0-1 ./test2
0
    host: frontend.domain node: 0 is the master
    host: compute-0-0.local node: 1 sent: 1 to: 0
    host: compute-0-1.local node: 2 sent: 2 to: 0
    host: frontend.domain node: 0 recv: 1 from: 1
    host: frontend.domain node: 0 recv: 2 from: 2
    host: frontend.domain made it past the barrier, ret:0
    host: compute-0-1.local made it past the barrier, ret:0
    host: compute-0-0.local made it past the barrier, ret:0

if i have a different node function as the master, it deadlocks:

    /users/gunter $ mpirun -np 3 -H frontend,compute-0-0,compute-0-1 ./test2
1
    host: compute-0-0.local node: 1 is the master
    host: compute-0-1.local node: 2 sent: 2 to: 1
    mpirun: killing job...

    mpirun noticed that job rank 0 with PID 15411 on node frontend exited on
signal 15 (Terminated).
    2 additional processes aborted (not shown)

how is it that in the first example, one node makes it past the barrier, and
the rest deadlock?

these programs both run to completion on two other MPI implementations.

is there something mis-configured on my cluster? or is this potentially an
openmpi bug?

what is the best way to debug this?

any help would be appreciated!

--tim