Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Dino Rossegger (dino.rossegger_at_[hidden])
Date: 2007-10-02 09:11:25


Here the Syntax & Output of the Command:
root_at_sun:~# mpirun --hostfile hostfile saturn
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:28777] ERROR: A daemon on node saturn failed to start as expected.
[sun:28777] ERROR: There may be more information available from
[sun:28777] ERROR: the remote shell (see above).
[sun:28777] ERROR: The daemon exited unexpectedly with status 255.
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

--------------------------------------------------------------------------

I'm using version 1.2.3, got it from openmpi.org. I'm using the same
version of openmpi on all nodes.

Thanks
dino

Tim Prins schrieb:
> This is very odd. The daemon is being launched properly, but then things
> get strange. It looks like mpirun is sending a message to kill
> application processes on saturn.
>
> What version of Open MPI are you using?
>
> Are you sure that the same version of Open MPI us being used everywhere?
>
> Can you try:
> mpirun --hostfile hostfile hostname
>
> Thanks,
>
> Tim
>
> Dino Rossegger wrote:
>> Hi again,
>>
>> Tim Prins schrieb:
>>> Hi,
>>>
>>> On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:
>>>> Hi again,
>>>>
>>>> Yes the error output is the same:
>>>> root_at_sun:~# mpirun --hostfile hostfile main
>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> base/pls_base_orted_cmds.c at line 275
>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>>> line 1164
>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>>>> [sun:23748] ERROR: A daemon on node saturn failed to start as expected.
>>>> [sun:23748] ERROR: There may be more information available from
>>>> [sun:23748] ERROR: the remote shell (see above).
>>>> [sun:23748] ERROR: The daemon exited unexpectedly with status 255.
>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> base/pls_base_orted_cmds.c at line 188
>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>>> line 1196
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to cleanly terminate the daemons for this job.
>>>> Returned value Timeout instead of ORTE_SUCCESS.
>>>>
>>>> --------------------------------------------------------------------------
>>> Can you try:
>>> mpirun --debug-daemons --hostfile hostfile main
>>>
>> Did it but it doesn't give me any special output (as far as I can see that)
>> Heres the output:
>> root_at_sun:~# mpirun --debug-daemons --hostfile hostfile ./main
>> Daemon [0,0,1] checking in as pid 27168 on host sun
>> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
>> ,0]
>> [sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs
>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
>> ase_orted_cmds.c at line 275
>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
>> dule.c at line 1164
>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp
>> .c at line 90
>> [sun:27167] ERROR: A daemon on node saturn failed to start as
>> expected.
>> [sun:27167] ERROR: There may be more information available fro
>> m
>> [sun:27167] ERROR: the remote shell (see above).
>> [sun:27167] ERROR: The daemon exited unexpectedly with status
>> 255.
>> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
>> ,0]
>> [sun:27168] [0,0,1] orted_recv_pls: received exit
>>
>>
>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
>> ase_orted_cmds.c at line 188
>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
>> dule.c at line 1196
>> --------------------------------------------------------------
>> ------------
>> mpirun was unable to cleanly terminate the daemons for this jo
>> b. Returned value Timeout instead of ORTE_SUCCESS.
>>
>> --------------------------------------------------------------
>> ------------
>>
>>> This may give more output about the error. Also, try
>>> mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main
>> Heres the output, but I cant decipher it ^^
>> root_at_sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil
>> e hostfile main
>> [sun:27175] pls:rsh: local csh: 0, local sh: 1
>> [sun:27175] pls:rsh: assuming same remote shell as local shell
>> [sun:27175] pls:rsh: remote csh: 0, remote sh: 1
>> [sun:27175] pls:rsh: final template argv:
>> [sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp
>> roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena
>> me <template> --universe root_at_sun:default-universe-27175 --nsr
>> eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733
>> " --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.
>> 202:4733"
>> [sun:27175] pls:rsh: launching on node sun
>> [sun:27175] pls:rsh: sun is a LOCAL node
>> [sun:27175] pls:rsh: changing to directory /root
>> [sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted -
>> -bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden
>> ame sun --universe root_at_sun:default-universe-27175 --nsreplica
>> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp
>> rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47
>> 33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash
>> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L
>> D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b
>> in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK
>> =/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr
>> /local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib
>> PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH
>> _CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin
>> /mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O
>> MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
>> [sun:27175] pls:rsh: launching on node saturn
>> [sun:27175] pls:rsh: saturn is a REMOTE node
>> [sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s
>>
>> aturn orted --bootproxy 1 --name
>> 0.0.2 --num_procs 3 --vpid_st
>>
>> art 0 --nodename saturn --universe root_at_sun:default-universe-2
>>
>> 7175 --nsreplica
>> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.
>>
>> 0.202:4733" --gprreplica
>> "0.0.0;tcp://192.168.1.254:4733;tcp:/
>>
>> /172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm
>> SHELL=/bin
>> /bash
>> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=
>>
>> root
>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:
>>
>>
>> /usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT
>>
>>
>> H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT
>>
>>
>> H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca
>>
>> l/lib PWD=/root LANG=en_US.UTF-8
>> SHLVL=1 HOME=/root LOGNAME=ro
>>
>> ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc
>>
>> al/bin/mpirun
>> OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo
>>
>> bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1164
>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>> [sun:27175] ERROR: A daemon on node saturn failed to start as expected.
>> [sun:27175] ERROR: There may be more information available from
>> [sun:27175] ERROR: the remote shell (see above).
>> [sun:27175] ERROR: The daemon exited unexpectedly with status 255.
>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 188
>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1196
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons for this job.
>> Returned value Timeout instead of ORTE_SUCCESS.
>>
>> --------------------------------------------------------------------------
>>
>>> This will print out the exact command that is used to launch the orted.
>>>
>>> Also, I would highly recommend not running Open MPI as root. It is just a bad
>>> idea.
>> Yes I know, I'm doing it just now for testing.
>>>> I wrote the following to my .ssh/environment (on all machines)
>>>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>>>> n:/opt/c3-4/:/usr/lib:/usr/local/lib;
>>>>
>>>> PATH=$PATH:/usr/local/lib;
>>>>
>>>> export LD_LIBRARY_PATH;
>>>> export PATH;
>>>>
>>>> and added the statement you told me to the ssd_config (on all machines):
>>>> PermitUserEnvironment yes
>>>>
>>>> And it seems to me that the pathes are correct now.
>>>>
>>>> My shell is bash (/bin/bash)
>>>>
>>>> When running locate orted (to find out where exactly my openmpi
>>>> installation is (compilation defaults) i saw that, on sun there was a
>>>> /usr/bin/orted while there wasn't one on saturn.
>>>> I deleted /usr/bin/orted on sun and tried again with the option --prefix
>>>> /usr/local/ (which seems to be my installation directory) but it
>>>> didn't work (same error).
>>> Is it possible that you are mixing 2 different installations of Open MPI? You
>>> may consider installing OpenMPI to a NFS drive to make these things a bit
>>> easier.
>>>> Is there a script or anything like that with which I can uninstall
>>>> openmpi, because i'll might try a new compilation to /opt/openmpi since
>>>> it doesn't look like I would be able to solve the problem.
>>> If you still have the tree around that you used to 'make' Open MPI, you can
>>> just go into that tree and type 'make uninstall'.
>>>
>>> Hope this helps,
>>>
>>> Tim
>>>
>>>> jody schrieb:
>>>>> Now that the PATHs seem to be set correctly for
>>>>> ssh i don't know what the problem could be.
>>>>>
>>>>> Is the error message still the same on as in the first mail?
>>>>> Did you do the envorpnment/sshd_config on both machines?
>>>>> What shell are you using?
>>>>>
>>>>> On other test you could make is to start your application
>>>>> with the --prefix option:
>>>>>
>>>>> $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main
>>>>>
>>>>> (assuming your Open MPI installation lies in /opt/openmpi
>>>>> on both machines)
>>>>>
>>>>>
>>>>> Jody
>>>>>
>>>>> On 10/1/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>> Hi Jodi,
>>>>>> did the steps as you said, but it didn't work for me.
>>>>>> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and
>>>>>> made the changes to sshd_config.
>>>>>>
>>>>>> But this all didn't solve my problem, although the pahts seemed to be
>>>>>> set correctly (judging what ssh saturn `printenv >> test` says). I also
>>>>>> restarted the ssh server, the error is the same.
>>>>>>
>>>>>> Hope you can help me out here and thanks for your help so far
>>>>>> dino
>>>>>>
>>>>>> jody schrieb:
>>>>>>> Dino -
>>>>>>> I had a similar problem.
>>>>>>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH
>>>>>>> in the file ~/ssh/environment on the client and setting
>>>>>>> PermitUserEnvironment yes
>>>>>>> in /etc/ssh/sshd_config on the server (for this you need root
>>>>>>> prioviledge though)
>>>>>>>
>>>>>>> To be on the safe side, i did both on all my nodes
>>>>>>>
>>>>>>> Jody
>>>>>>>
>>>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>>> Hi Jody,
>>>>>>>>
>>>>>>>> Thanks for your help, it really is the case that either in PATH nor in
>>>>>>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
>>>>>>>> hope it works.
>>>>>>>>
>>>>>>>> jody schrieb:
>>>>>>>>> Hi Dino
>>>>>>>>>
>>>>>>>>> Try
>>>>>>>>> ssh saturn printenv | grep PATH
>>>>>>>>>
>>>>>>>>> >from your host sun to see what your environment variables are when
>>>>>>>>>
>>>>>>>>> ssh is run without a shell.
>>>>>>>>>
>>>>>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have a problem running a simple programm mpihello.cpp.
>>>>>>>>>>
>>>>>>>>>> Here is a excerp of the error and the command
>>>>>>>>>> root_at_sun:~# mpirun -H sun,saturn main
>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>>>> at line 1164
>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
>>>>>>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start
>>>>>>>>>> as expected. [sun:25213] ERROR: There may be more information
>>>>>>>>>> available from [sun:25213] ERROR: the remote shell (see above).
>>>>>>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>>>> base/pls_base_orted_cmds.c at line 188
>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>>>> at line 1196
>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>> ------ mpirun was unable to cleanly terminate the daemons for this
>>>>>>>>>> job. Returned value Timeout instead of ORTE_SUCCESS.
>>>>>>>>>>
>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>> ------
>>>>>>>>>>
>>>>>>>>>> The program is runable from each node alone (mpirun -np2 main)
>>>>>>>>>>
>>>>>>>>>> My PathVariables:
>>>>>>>>>> $PATH
>>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>>>> -4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
>>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>>>> -4/:/usr/lib:/usr/local/lib
>>>>>>>>>>
>>>>>>>>>> Passwordless ssh is up 'n running
>>>>>>>>>>
>>>>>>>>>> I walked through the FAQ and Mailing Lists but couldn't find any
>>>>>>>>>> solution for my problem.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Dino R.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>