Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Leopard problems
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-02-11 22:33:46


There is a known problem with Leopard and Open MPI of all versions. We
haven't had time to chase it down yet - probably still a few weeks away.

Ralph

On 2/11/08 1:39 PM, "Greg Watson" <g.watson_at_[hidden]> wrote:

> Hi,
>
> Since I upgraded to MacOS X 10.5.1, I've been having problems running
> MPI programs (using both 1.2.4 and 1.2.5). The symptoms are
> intermittent (i.e. sometimes the application runs fine), and appear as
> follows:
>
> 1. One or more of the application processes die (I've see both one and
> two processes die).
>
> 2. (It appears) that the orted's associated with these application
> process then spin continually.
>
> Here is what I see when I run "mpirun -np 4 ./mpitest":
>
> 12467 ?? Rs 1:26.52 orted --bootproxy 1 --name 0.0.1 --
> num_procs 5 --vpid_start 0 --nodename node0 --universe
> greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12468 ?? Rs 1:26.63 orted --bootproxy 1 --name 0.0.2 --
> num_procs 5 --vpid_start 0 --nodename node1 --universe
> greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12469 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.3 --
> num_procs 5 --vpid_start 0 --nodename node2 --universe
> greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12470 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.4 --
> num_procs 5 --vpid_start 0 --nodename node3 --universe
> greg_at_Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12471 ?? S 0:00.05 ./mpitest
> 12472 ?? S 0:00.05 ./mpitest
>
> Killing the mpirun results in:
>
> $ mpirun -np 4 ./mpitest
> ^Cmpirun: killing job...
>
> ^
> C
> --------------------------------------------------------------------------
> WARNING: mpirun is in the process of killing a job, but has detected an
> interruption (probably control-C).
>
> It is dangerous to interrupt mpirun while it is killing a job (proper
> termination may not be guaranteed). Hit control-C again within 1
> second if you really want to kill mpirun immediately.
> --------------------------------------------------------------------------
> ^Cmpirun: forcibly killing job...
> --------------------------------------------------------------------------
> WARNING: mpirun has exited before it received notification that all
> started processes had terminated. You should double check and ensure
> that there are no runaway processes still executing.
> --------------------------------------------------------------------------
>
> At this point, the two spinning orted's are left running, and the only
> way to kill them is with -9.
>
> Is anyone else seeing this problem?
>
> Greg
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel