Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] no reaction of remote hosts after ompi reinstall [follow up]
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-21 09:29:00


Sorry for the delay in replying.

I'd check two things:

- Disable all firewall support between these two machines. OMPI uses
random TCP ports to communicate between processes; if they're blocked,
Bad Things will happen.

- It is easiest to install OMPI in the same location on all your
machines (e.g., /opt/openmpi). If you do that, you might want to try
configuring OMPI with --enable-mpirun-prefix-by-default. In rsh/ssh
environments, this flag will have mpirun set your PATH and
LD_LIBRARY_PATH properly on remote nodes.

Let us know how that works out.

On Jun 10, 2008, at 8:58 AM, jody wrote:

> Interestingly i can start mpirun from any of the remote machines,
> running processes on other remote machines and on the local machine,.
> But from the local machine i can start no process on a remote
> machine -
> it just shows the behavior detailed in the previous mail.
>
> remote1 -> remote1 ok
> remote1 -> remote2 ok
> remote1 -> local ok
>
> remote2 -> remote1 ok
> remote2 -> remote2 ok
> remote2 -> local ok
>
> local -> local ok
> local -> remote1 fails
> local -> remote2 fails
>
> My remote machines are freshly updated gentoo machines (AMD),
> my local machne is a freshly installed fedora 8 (Intel Quadro).
> All use a freshly installed open-mpi 1.2.5.
> Before my fedora machine crashed it had fedora 6,
> and everything worked great (with 1.2.2 on all machines).
>
> Does anybody have a suggestion where i should look?
>
> Thanks
> Jody
>
>
> On Tue, Jun 10, 2008 at 12:59 PM, jody <jody.xha_at_[hidden]> wrote:
>> Hi
>> after a crash i reinstalled open-mpi 1.2.5 on my machines,
>> used
>> ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default
>> and set PATH and LD_LIBRARY_PATH in .bashrc:
>> PATH=/opt/openmpi/bin:$PATH
>> export PATH
>> LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
>> export LD_LIBRARY_PATH
>>
>> First problem:
>> ssh nano_00 printenv
>> does not contain the correct paths (and no LD_LIBRARY_PATH at all),
>> but with a normal ssh-login the two are set correctly.
>>
>> When i run a test application on one computer, it works.
>>
>> As soon as an additional computer is involved, there is no more
>> output,
>> and everything just hangs.
>>
>> Adding the prefix doesn't change anything, even though openmpi is
>> installed in the same
>> directory (/opt/openmpi) on every computer.
>>
>> The debug-daemon doesn't help very much:
>>
>> $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest
>> Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch
>>
>> (and nothing happens anymore)
>>
>> On the remote host, i see the following three processes coming up
>> after i do the mpirun on the local machine:
>> 30603 ? S 0:00 sshd: jody_at_notty
>> 30604 ? Ss 0:00 bash -c PATH=/opt/openmpi/bin:$PATH ;
>> export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ;
>> export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons
>> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --
>> 30605 ? S 0:00 /opt/openmpi/bin/orted --debug-daemons
>> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
>> nano_00 --universe jody_at_[hidden]:default-universe-14934
>> --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl
>>
>> So it looks as if the correct paths are set (probably the doing of
>> --enable-mpirun-prefix-by-default)
>>
>> If i interrupt on the local machine (Ctrl-C)::
>>
>> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from
>> [0,0,0]
>> [aim-plankton:14983] [0,0,1] orted_recv_pls: received
>> kill_local_procs
>> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from
>> [0,0,0]
>> [aim-plankton:14983] [0,0,1] orted_recv_pls: received
>> kill_local_procs
>> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> pls_rsh_module.c at line 1166
>> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> errmgr_hnp.c at line 90
>> [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start
>> as expected.
>> [aim-plankton:14982] ERROR: There may be more information available
>> from
>> [aim-plankton:14982] ERROR: the remote shell (see above).
>> [aim-plankton:14982] ERROR: The daemon exited unexpectedly with
>> status 255.
>> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> pls_rsh_module.c at line 1166
>> --------------------------------------------------------------------------
>> WARNING: mpirun has exited before it received notification that all
>> started processes had terminated. You should double check and ensure
>> that there are no runaway processes still executing.
>> --------------------------------------------------------------------------
>> [aim-plankton:14983] OOB: Connection to HNP lost
>>
>> On the remote machine, the "sshd: jody_at_notty" process is gone, but
>> the
>> other two stay.
>> I would be grateful for any suggestions!
>>
>> Jody
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems