Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Tim Prins (tprins_at_[hidden])
Date: 2007-10-01 22:41:32


Hi,

On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:
> Hi again,
>
> Yes the error output is the same:
> root_at_sun:~# mpirun --hostfile hostfile main
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1164
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
> [sun:23748] ERROR: A daemon on node saturn failed to start as expected.
> [sun:23748] ERROR: There may be more information available from
> [sun:23748] ERROR: the remote shell (see above).
> [sun:23748] ERROR: The daemon exited unexpectedly with status 255.
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1196
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
>
> --------------------------------------------------------------------------
Can you try:
mpirun --debug-daemons --hostfile hostfile main

This may give more output about the error. Also, try
mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main

This will print out the exact command that is used to launch the orted.

Also, I would highly recommend not running Open MPI as root. It is just a bad
idea.
>
> I wrote the following to my .ssh/environment (on all machines)
> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>n:/opt/c3-4/:/usr/lib:/usr/local/lib;
>
> PATH=$PATH:/usr/local/lib;
>
> export LD_LIBRARY_PATH;
> export PATH;
>
> and added the statement you told me to the ssd_config (on all machines):
> PermitUserEnvironment yes
>
> And it seems to me that the pathes are correct now.
>
> My shell is bash (/bin/bash)
>
> When running locate orted (to find out where exactly my openmpi
> installation is (compilation defaults) i saw that, on sun there was a
> /usr/bin/orted while there wasn't one on saturn.
> I deleted /usr/bin/orted on sun and tried again with the option --prefix
> /usr/local/ (which seems to be my installation directory) but it
> didn't work (same error).
Is it possible that you are mixing 2 different installations of Open MPI? You
may consider installing OpenMPI to a NFS drive to make these things a bit
easier.
>
> Is there a script or anything like that with which I can uninstall
> openmpi, because i'll might try a new compilation to /opt/openmpi since
> it doesn't look like I would be able to solve the problem.
If you still have the tree around that you used to 'make' Open MPI, you can
just go into that tree and type 'make uninstall'.

Hope this helps,

Tim

>
> jody schrieb:
> > Now that the PATHs seem to be set correctly for
> > ssh i don't know what the problem could be.
> >
> > Is the error message still the same on as in the first mail?
> > Did you do the envorpnment/sshd_config on both machines?
> > What shell are you using?
> >
> > On other test you could make is to start your application
> > with the --prefix option:
> >
> > $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main
> >
> > (assuming your Open MPI installation lies in /opt/openmpi
> > on both machines)
> >
> >
> > Jody
> >
> > On 10/1/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
> >> Hi Jodi,
> >> did the steps as you said, but it didn't work for me.
> >> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and
> >> made the changes to sshd_config.
> >>
> >> But this all didn't solve my problem, although the pahts seemed to be
> >> set correctly (judging what ssh saturn `printenv >> test` says). I also
> >> restarted the ssh server, the error is the same.
> >>
> >> Hope you can help me out here and thanks for your help so far
> >> dino
> >>
> >> jody schrieb:
> >>> Dino -
> >>> I had a similar problem.
> >>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH
> >>> in the file ~/ssh/environment on the client and setting
> >>> PermitUserEnvironment yes
> >>> in /etc/ssh/sshd_config on the server (for this you need root
> >>> prioviledge though)
> >>>
> >>> To be on the safe side, i did both on all my nodes
> >>>
> >>> Jody
> >>>
> >>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
> >>>> Hi Jody,
> >>>>
> >>>> Thanks for your help, it really is the case that either in PATH nor in
> >>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
> >>>> hope it works.
> >>>>
> >>>> jody schrieb:
> >>>>> Hi Dino
> >>>>>
> >>>>> Try
> >>>>> ssh saturn printenv | grep PATH
> >>>>>
> >>>>> >from your host sun to see what your environment variables are when
> >>>>>
> >>>>> ssh is run without a shell.
> >>>>>
> >>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I have a problem running a simple programm mpihello.cpp.
> >>>>>>
> >>>>>> Here is a excerp of the error and the command
> >>>>>> root_at_sun:~# mpirun -H sun,saturn main
> >>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >>>>>> base/pls_base_orted_cmds.c at line 275
> >>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
> >>>>>> at line 1164
> >>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
> >>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start
> >>>>>> as expected. [sun:25213] ERROR: There may be more information
> >>>>>> available from [sun:25213] ERROR: the remote shell (see above).
> >>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
> >>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >>>>>> base/pls_base_orted_cmds.c at line 188
> >>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
> >>>>>> at line 1196
> >>>>>> --------------------------------------------------------------------
> >>>>>>------ mpirun was unable to cleanly terminate the daemons for this
> >>>>>> job. Returned value Timeout instead of ORTE_SUCCESS.
> >>>>>>
> >>>>>> --------------------------------------------------------------------
> >>>>>>------
> >>>>>>
> >>>>>> The program is runable from each node alone (mpirun -np2 main)
> >>>>>>
> >>>>>> My PathVariables:
> >>>>>> $PATH
> >>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
> >>>>>>-4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
> >>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
> >>>>>>-4/:/usr/lib:/usr/local/lib
> >>>>>>
> >>>>>> Passwordless ssh is up 'n running
> >>>>>>
> >>>>>> I walked through the FAQ and Mailing Lists but couldn't find any
> >>>>>> solution for my problem.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Dino R.
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users