Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] no reaction of remote hosts after ompi reinstall [follow up]
From: jody (jody.xha_at_[hidden])
Date: 2008-06-10 08:58:26


Interestingly i can start mpirun from any of the remote machines,
running processes on other remote machines and on the local machine,.
But from the local machine i can start no process on a remote machine -
it just shows the behavior detailed in the previous mail.

remote1 -> remote1 ok
remote1 -> remote2 ok
remote1 -> local ok

remote2 -> remote1 ok
remote2 -> remote2 ok
remote2 -> local ok

local -> local ok
local -> remote1 fails
local -> remote2 fails

My remote machines are freshly updated gentoo machines (AMD),
my local machne is a freshly installed fedora 8 (Intel Quadro).
All use a freshly installed open-mpi 1.2.5.
Before my fedora machine crashed it had fedora 6,
and everything worked great (with 1.2.2 on all machines).

Does anybody have a suggestion where i should look?

Thanks
  Jody

On Tue, Jun 10, 2008 at 12:59 PM, jody <jody.xha_at_[hidden]> wrote:
> Hi
> after a crash i reinstalled open-mpi 1.2.5 on my machines,
> used
> ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default
> and set PATH and LD_LIBRARY_PATH in .bashrc:
> PATH=/opt/openmpi/bin:$PATH
> export PATH
> LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
> export LD_LIBRARY_PATH
>
> First problem:
> ssh nano_00 printenv
> does not contain the correct paths (and no LD_LIBRARY_PATH at all),
> but with a normal ssh-login the two are set correctly.
>
> When i run a test application on one computer, it works.
>
> As soon as an additional computer is involved, there is no more output,
> and everything just hangs.
>
> Adding the prefix doesn't change anything, even though openmpi is
> installed in the same
> directory (/opt/openmpi) on every computer.
>
> The debug-daemon doesn't help very much:
>
> $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest
> Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch
>
> (and nothing happens anymore)
>
> On the remote host, i see the following three processes coming up
> after i do the mpirun on the local machine:
> 30603 ? S 0:00 sshd: jody_at_notty
> 30604 ? Ss 0:00 bash -c PATH=/opt/openmpi/bin:$PATH ;
> export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ;
> export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons
> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --
> 30605 ? S 0:00 /opt/openmpi/bin/orted --debug-daemons
> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
> nano_00 --universe jody_at_[hidden]:default-universe-14934
> --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl
>
> So it looks as if the correct paths are set (probably the doing of
> --enable-mpirun-prefix-by-default)
>
> If i interrupt on the local machine (Ctrl-C)::
>
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1166
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c at line 90
> [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start
> as expected.
> [aim-plankton:14982] ERROR: There may be more information available from
> [aim-plankton:14982] ERROR: the remote shell (see above).
> [aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255.
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1166
> --------------------------------------------------------------------------
> WARNING: mpirun has exited before it received notification that all
> started processes had terminated. You should double check and ensure
> that there are no runaway processes still executing.
> --------------------------------------------------------------------------
> [aim-plankton:14983] OOB: Connection to HNP lost
>
> On the remote machine, the "sshd: jody_at_notty" process is gone, but the
> other two stay.
> I would be grateful for any suggestions!
>
> Jody
>