Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: jody (jody.xha_at_[hidden])
Date: 2007-08-16 05:34:03


Hi Tim

Just a quick update about my ssh/LD_LIBRARY_PATH problem.

Apparently on my System the sshd was configured not to permit
user defined environment variables (security reasons?).
To fix that i had to change the file
  /etc/ssh/sshd_config
By changing the entry
  #PermitUserEnvironment no
to
  PermitUserEnvironment yes
and adding these lines to the file ~/.ssh/environment
  PATH=/opt/openmpi/bin:/usr/local/bin:/bin:/usr/bin
  LD_LIBRARY_PATH=/opt/openmpi/lib
Maybe it is an overkill, but at least ssh now makes the two variables available,
and simple openmpi test applications run.

I have done this fixes on all my 7 gentoo machines (nano_00 - nano_06),
and simple openmpi test applications run with any number of processes.

But the fedora machine (plankton) still has problems in some cases.
In the test application i use, process #0 broadcasts a number to all
other processes.
This works in the following cases always calling from nano_02:
 mpirun -np 3 --host nano_00 ./MPITest
 mpirun -np 3 --host plankton ./MPITest
 mpirun -np 3 --host plankton,nano_00 ./MPITest
But it doesn't work like this:
 mpirun -np 4 --host nano_00,plankton ./MPITest
as soon as the MPI_Broadcast statement is rached,
i get an errorr message:
[nano_00][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113

Does this still agree with your firewall hypothesis?

Thanks
  Jody

On 8/14/07, Tim Prins <tprins_at_[hidden]> wrote:
> Jody,
>
> jody wrote:
> > Hi TIm
> > thanks for the suggestions.
> >
> > I now set both paths in .zshenv but it seems that LD_LIBRARY_PATH
> > still does not get set.
> > The ldd experment shows that all openmpi libraries are not found,
> > and indeed the printenv shows that PATH is there but LD_LIBRARY_PATH is
> > not.
> Are you setting LD_LIBRARY_PATH anywhere else in your scripts? I have,
> on more than one occasion, forgotten that I needed to do:
> export LD_LIBRARY_PATH="/foo:$LD_LIBRARY_PATH"
>
> Instead of just:
> export LD_LIBRARY_PATH="/foo"
>
> >
> > It is rather unclear why this happens...
> >
> > As to thew second problem:
> > $ mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02
> > ./MPI2Test2
> > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect:
> > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed:
> > (103)
> > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect:
> > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed,
> > connecting over all interfaces failed!
> > [aim-nano_02:05455] OOB: Connection to HNP lost
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at
> > line 275
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: A daemon on node nano_02 failed to start as expected.
> > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: There may be more information available from
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: the remote shell (see above).
> > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: The daemon exited unexpectedly with status 1.
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at
> > line 188
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196
> >
> > The strange thing is that nano_02's address is 130.60.49.130
> > <http://130.60.49.130> and plankton's (the caller) is 130.60.49 134.
> > I also made sure that nano_02 cann ssh to plankton without password, but
> > that didn't change the output.
>
> What is happening here is that the daemon launched on nano_02 is trying
> to contact mpirun on plankton, and is failing for some reason.
>
> Do you have any firewalls/port filtering enabled on nano_02? Open MPI
> generally cannot be run when there are any firewalls on the machines
> being used.
>
> Hope this helps,
>
> Tim
>
> >
> > Does this message give any hints as to the problem?
> >
> > Jody
> >
> >
> > On 8/14/07, *Tim Prins* <tprins_at_[hidden]
> > <mailto:tprins_at_[hidden]>> wrote:
> >
> > Hi Jody,
> >
> > jody wrote:
> > > Hi
> > > I installed openmpi 1.2.2 on a quad core intel machine running
> > fedora 6
> > > (hostname plankton)
> > > I set PATH and LD_LIBRARY in the .zshrc file:
> > Note that .zshrc is only used for interactive logins. You need to setup
> > your system so the LD_LIBRARY_PATH and PATH is also set for
> > non-interactive logins. See this zsh FAQ entry for what files you need
> > to modify:
> > http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19
> > <http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19>
> >
> > (BTW: I do not use zsh, but my assumption is that the file you want to
> > set the PATH and LD_LIBRARY_PATH in is .zshenv)
> > > $ echo $PATH
> > >
> > /opt/openmpi/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/jody/bin
> >
> > >
> > > $ echo $LD_LIBRARY_PATH
> > > /opt/openmpi/lib:
> > >
> > > When i run
> > > $ mpirun -np 2 ./MPITest2
> > > i get the message
> > > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
> > > cannot open shared object file: No such file or directory
> > > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
> > > cannot open shared object file: No such file or directory
> > >
> > > However
> > > $ mpirun -np 2 --prefix /opt/openmpi ./MPI2Test2
> > > works. Any explanation?
> > Yes, the LD_LIBRARY_PATH is probably not set correctly. Try running:
> > mpirun -np 2 ldd ./MPITest2
> >
> > This should show what libraries your executable is using. Make sure all
> > of the libraries are resolved.
> >
> > Also, try running:
> > mpirun -np 1 printenv |grep LD_LIBRARY_PATH
> > to see what the LD_LIBRARY_PATH is for you executables. Note that you
> > can NOT simply run mpirun echo $LD_LIBRARY_PATH, as the variable
> > will be
> > interpreted in the executing shell.
> >
> > >
> > > Second problem:
> > > I have also installed openmpi 1.2.2 on an AMD machine running gentoo
> > > linux (hostname nano_02).
> > > Here as well PATH and LD_LIBRARY_PATH are set correctly,
> > > and
> > > $ mpirun -np 2 ./MPITest2
> > > works locally on nano_02.
> > >
> > > If, however, from plankton i call
> > > $ mpirun -np 2 --prefix /opt/openmpi --host nano_02 ./MPI2Test2
> > > the call hangs with no output whatsoever.
> > > Any pointers on how to solve this problem?
> > Try running:
> > mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02
> > ./MPI2Test2
> >
> > This should give some more output as to what is happening.
> >
> > Hope this helps,
> >
> > Tim
> >
> > >
> > > Thank You
> > > Jody
> > >
> > >
> > >
> > >
> > ------------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden] <mailto:users_at_[hidden]>
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden] <mailto:users_at_[hidden]>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>