Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Leopard problems
From: Greg Watson (g.watson_at_[hidden])
Date: 2008-02-11 15:39:25


Hi,

Since I upgraded to MacOS X 10.5.1, I've been having problems running
MPI programs (using both 1.2.4 and 1.2.5). The symptoms are
intermittent (i.e. sometimes the application runs fine), and appear as
follows:

1. One or more of the application processes die (I've see both one and
two processes die).

2. (It appears) that the orted's associated with these application
process then spin continually.

Here is what I see when I run "mpirun -np 4 ./mpitest":

12467 ?? Rs 1:26.52 orted --bootproxy 1 --name 0.0.1 --
num_procs 5 --vpid_start 0 --nodename node0 --universe
greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12468 ?? Rs 1:26.63 orted --bootproxy 1 --name 0.0.2 --
num_procs 5 --vpid_start 0 --nodename node1 --universe
greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12469 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.3 --
num_procs 5 --vpid_start 0 --nodename node2 --universe
greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12470 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.4 --
num_procs 5 --vpid_start 0 --nodename node3 --universe
greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12471 ?? S 0:00.05 ./mpitest
12472 ?? S 0:00.05 ./mpitest

Killing the mpirun results in:

$ mpirun -np 4 ./mpitest
^Cmpirun: killing job...

^
C
--------------------------------------------------------------------------
WARNING: mpirun is in the process of killing a job, but has detected an
interruption (probably control-C).

It is dangerous to interrupt mpirun while it is killing a job (proper
termination may not be guaranteed). Hit control-C again within 1
second if you really want to kill mpirun immediately.
--------------------------------------------------------------------------
^Cmpirun: forcibly killing job...
--------------------------------------------------------------------------
WARNING: mpirun has exited before it received notification that all
started processes had terminated. You should double check and ensure
that there are no runaway processes still executing.
--------------------------------------------------------------------------

At this point, the two spinning orted's are left running, and the only
way to kill them is with -9.

Is anyone else seeing this problem?

Greg