Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: jody (jody.xha_at_[hidden])
Date: 2007-08-16 05:34:03


Hi Tim

Just a quick update about my ssh/LD_LIBRARY_PATH problem.

Apparently on my System the sshd was configured not to permit
user defined environment variables (security reasons?).
To fix that i had to change the file
  /etc/ssh/sshd_config
By changing the entry
  #PermitUserEnvironment no
to
  PermitUserEnvironment yes
and adding these lines to the file ~/.ssh/environment
  PATH=/opt/openmpi/bin:/usr/local/bin:/bin:/usr/bin
  LD_LIBRARY_PATH=/opt/openmpi/lib
Maybe it is an overkill, but at least ssh now makes the two variables available,
and simple openmpi test applications run.

I have done this fixes on all my 7 gentoo machines (nano_00 - nano_06),
and simple openmpi test applications run with any number of processes.

But the fedora machine (plankton) still has problems in some cases.
In the test application i use, process #0 broadcasts a number to all
other processes.
This works in the following cases always calling from nano_02:
 mpirun -np 3 --host nano_00 ./MPITest
 mpirun -np 3 --host plankton ./MPITest
 mpirun -np 3 --host plankton,nano_00 ./MPITest
But it doesn't work like this:
 mpirun -np 4 --host nano_00,plankton ./MPITest
as soon as the MPI_Broadcast statement is rached,
i get an errorr message:
[nano_00][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113

Does this still agree with your firewall hypothesis?

Thanks
  Jody

On 8/14/07, Tim Prins <tprins_at_[hidden]> wrote:
> Jody,
>
> jody wrote:
> > Hi TIm
> > thanks for the suggestions.
> >
> > I now set both paths in .zshenv but it seems that LD_LIBRARY_PATH
> > still does not get set.
> > The ldd experment shows that all openmpi libraries are not found,
> > and indeed the printenv shows that PATH is there but LD_LIBRARY_PATH is
> > not.
> Are you setting LD_LIBRARY_PATH anywhere else in your scripts? I have,
> on more than one occasion, forgotten that I needed to do:
> export LD_LIBRARY_PATH="/foo:$LD_LIBRARY_PATH"
>
> Instead of just:
> export LD_LIBRARY_PATH="/foo"
>
> >
> > It is rather unclear why this happens...
> >
> > As to thew second problem:
> > $ mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02
> > ./MPI2Test2
> > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect:
> > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed:
> > (103)
> > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect:
> > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed,
> > connecting over all interfaces failed!
> > [aim-nano_02:05455] OOB: Connection to HNP lost
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at
> > line 275
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: A daemon on node nano_02 failed to start as expected.
> > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: There may be more information available from
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: the remote shell (see above).
> > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: The daemon exited unexpectedly with status 1.
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at
> > line 188
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196
> >
> > The strange thing is that nano_02's address is 130.60.49.130
> > <http://130.60.49.130> and plankton's (the caller) is 130.60.49 134.
> > I also made sure that nano_02 cann ssh to plankton without password, but
> > that didn't change the output.
>
> What is happening here is that the daemon launched on nano_02 is trying
> to contact mpirun on plankton, and is failing for some reason.
>
> Do you have any firewalls/port filtering enabled on nano_02? Open MPI
> generally cannot be run when there are any firewalls on the machines
> being used.
>
> Hope this helps,
>
> Tim
>
> >
> > Does this message give any hints as to the problem?
> >
> > Jody
> >
> >
> > On 8/14/07, *Tim Prins* <tprins_at_[hidden]
> > <mailto:tprins_at_[hidden]>> wrote:
> >
> > Hi Jody,
> >
> > jody wrote:
> > > Hi
> > > I installed openmpi 1.2.2 on a quad core intel machine running
> > fedora 6
> > > (hostname plankton)
> > > I set PATH and LD_LIBRARY in the .zshrc file:
> > Note that .zshrc is only used for interactive logins. You need to setup
> > your system so the LD_LIBRARY_PATH and PATH is also set for
> > non-interactive logins. See this zsh FAQ entry for what files you need
> > to modify:
> > http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19
> > <http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19>
> >
> > (BTW: I do not use zsh, but my assumption is that the file you want to
> > set the PATH and LD_LIBRARY_PATH in is .zshenv)
> > > $ echo $PATH
> > >
> > /opt/openmpi/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/jody/bin
> >
> > >
> > > $ echo $LD_LIBRARY_PATH
> > > /opt/openmpi/lib:
> > >
> > > When i run
> > > $ mpirun -np 2 ./MPITest2
> > > i get the message
> > > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
> > > cannot open shared object file: No such file or directory
> > > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
> > > cannot open shared object file: No such file or directory
> > >
> > > However
> > > $ mpirun -np 2 --prefix /opt/openmpi ./MPI2Test2
> > > works. Any explanation?
> > Yes, the LD_LIBRARY_PATH is probably not set correctly. Try running:
> > mpirun -np 2 ldd ./MPITest2
> >
> > This should show what libraries your executable is using. Make sure all
> > of the libraries are resolved.
> >
> > Also, try running:
> > mpirun -np 1 printenv |grep LD_LIBRARY_PATH
> > to see what the LD_LIBRARY_PATH is for you executables. Note that you
> > can NOT simply run mpirun echo $LD_LIBRARY_PATH, as the variable
> > will be
> > interpreted in the executing shell.
> >
> > >
> > > Second problem:
> > > I have also installed openmpi 1.2.2 on an AMD machine running gentoo
> > > linux (hostname nano_02).
> > > Here as well PATH and LD_LIBRARY_PATH are set correctly,
> > > and
> > > $ mpirun -np 2 ./MPITest2
> > > works locally on nano_02.
> > >
> > > If, however, from plankton i call
> > > $ mpirun -np 2 --prefix /opt/openmpi --host nano_02 ./MPI2Test2
> > > the call hangs with no output whatsoever.
> > > Any pointers on how to solve this problem?
> > Try running:
> > mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02
> > ./MPI2Test2
> >
> > This should give some more output as to what is happening.
> >
> > Hope this helps,
> >
> > Tim
> >
> > >
> > > Thank You
> > > Jody
> > >
> > >
> > >
> > >
> > ------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden] <mailto:users_at_[hidden]>
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden] <mailto:users_at_[hidden]>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>