Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Tim Prins (tprins_at_[hidden])
Date: 2007-10-02 08:22:37


This is very odd. The daemon is being launched properly, but then things
get strange. It looks like mpirun is sending a message to kill
application processes on saturn.

What version of Open MPI are you using?

Are you sure that the same version of Open MPI us being used everywhere?

Can you try:
mpirun --hostfile hostfile hostname

Thanks,

Tim

Dino Rossegger wrote:
> Hi again,
>
> Tim Prins schrieb:
>> Hi,
>>
>> On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:
>>> Hi again,
>>>
>>> Yes the error output is the same:
>>> root_at_sun:~# mpirun --hostfile hostfile main
>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>> base/pls_base_orted_cmds.c at line 275
>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>> line 1164
>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>>> [sun:23748] ERROR: A daemon on node saturn failed to start as expected.
>>> [sun:23748] ERROR: There may be more information available from
>>> [sun:23748] ERROR: the remote shell (see above).
>>> [sun:23748] ERROR: The daemon exited unexpectedly with status 255.
>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>> base/pls_base_orted_cmds.c at line 188
>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>> line 1196
>>> --------------------------------------------------------------------------
>>> mpirun was unable to cleanly terminate the daemons for this job.
>>> Returned value Timeout instead of ORTE_SUCCESS.
>>>
>>> --------------------------------------------------------------------------
>> Can you try:
>> mpirun --debug-daemons --hostfile hostfile main
>>
> Did it but it doesn't give me any special output (as far as I can see that)
> Heres the output:
> root_at_sun:~# mpirun --debug-daemons --hostfile hostfile ./main
> Daemon [0,0,1] checking in as pid 27168 on host sun
> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
> ,0]
> [sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs
> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
> ase_orted_cmds.c at line 275
> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
> dule.c at line 1164
> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp
> .c at line 90
> [sun:27167] ERROR: A daemon on node saturn failed to start as
> expected.
> [sun:27167] ERROR: There may be more information available fro
> m
> [sun:27167] ERROR: the remote shell (see above).
> [sun:27167] ERROR: The daemon exited unexpectedly with status
> 255.
> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
> ,0]
> [sun:27168] [0,0,1] orted_recv_pls: received exit
>
>
> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
> ase_orted_cmds.c at line 188
> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
> dule.c at line 1196
> --------------------------------------------------------------
> ------------
> mpirun was unable to cleanly terminate the daemons for this jo
> b. Returned value Timeout instead of ORTE_SUCCESS.
>
> --------------------------------------------------------------
> ------------
>
>> This may give more output about the error. Also, try
>> mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main
>
> Heres the output, but I cant decipher it ^^
> root_at_sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil
> e hostfile main
> [sun:27175] pls:rsh: local csh: 0, local sh: 1
> [sun:27175] pls:rsh: assuming same remote shell as local shell
> [sun:27175] pls:rsh: remote csh: 0, remote sh: 1
> [sun:27175] pls:rsh: final template argv:
> [sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp
> roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena
> me <template> --universe root_at_sun:default-universe-27175 --nsr
> eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733
> " --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.
> 202:4733"
> [sun:27175] pls:rsh: launching on node sun
> [sun:27175] pls:rsh: sun is a LOCAL node
> [sun:27175] pls:rsh: changing to directory /root
> [sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted -
> -bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden
> ame sun --universe root_at_sun:default-universe-27175 --nsreplica
> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp
> rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47
> 33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash
> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L
> D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b
> in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK
> =/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr
> /local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib
> PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH
> _CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin
> /mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O
> MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
> [sun:27175] pls:rsh: launching on node saturn
> [sun:27175] pls:rsh: saturn is a REMOTE node
> [sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s
>
> aturn orted --bootproxy 1 --name
> 0.0.2 --num_procs 3 --vpid_st
>
> art 0 --nodename saturn --universe root_at_sun:default-universe-2
>
> 7175 --nsreplica
> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.
>
> 0.202:4733" --gprreplica
> "0.0.0;tcp://192.168.1.254:4733;tcp:/
>
> /172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm
> SHELL=/bin
> /bash
> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=
>
> root
> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:
>
>
> /usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT
>
>
> H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT
>
>
> H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca
>
> l/lib PWD=/root LANG=en_US.UTF-8
> SHLVL=1 HOME=/root LOGNAME=ro
>
> ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc
>
> al/bin/mpirun
> OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo
>
> bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1164
> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
> [sun:27175] ERROR: A daemon on node saturn failed to start as expected.
> [sun:27175] ERROR: There may be more information available from
> [sun:27175] ERROR: the remote shell (see above).
> [sun:27175] ERROR: The daemon exited unexpectedly with status 255.
> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
> line 1196
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
>
> --------------------------------------------------------------------------
>
>> This will print out the exact command that is used to launch the orted.
>>
>> Also, I would highly recommend not running Open MPI as root. It is just a bad
>> idea.
>
> Yes I know, I'm doing it just now for testing.
>>> I wrote the following to my .ssh/environment (on all machines)
>>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>>> n:/opt/c3-4/:/usr/lib:/usr/local/lib;
>>>
>>> PATH=$PATH:/usr/local/lib;
>>>
>>> export LD_LIBRARY_PATH;
>>> export PATH;
>>>
>>> and added the statement you told me to the ssd_config (on all machines):
>>> PermitUserEnvironment yes
>>>
>>> And it seems to me that the pathes are correct now.
>>>
>>> My shell is bash (/bin/bash)
>>>
>>> When running locate orted (to find out where exactly my openmpi
>>> installation is (compilation defaults) i saw that, on sun there was a
>>> /usr/bin/orted while there wasn't one on saturn.
>>> I deleted /usr/bin/orted on sun and tried again with the option --prefix
>>> /usr/local/ (which seems to be my installation directory) but it
>>> didn't work (same error).
>> Is it possible that you are mixing 2 different installations of Open MPI? You
>> may consider installing OpenMPI to a NFS drive to make these things a bit
>> easier.
>>> Is there a script or anything like that with which I can uninstall
>>> openmpi, because i'll might try a new compilation to /opt/openmpi since
>>> it doesn't look like I would be able to solve the problem.
>> If you still have the tree around that you used to 'make' Open MPI, you can
>> just go into that tree and type 'make uninstall'.
>>
>> Hope this helps,
>>
>> Tim
>>
>>> jody schrieb:
>>>> Now that the PATHs seem to be set correctly for
>>>> ssh i don't know what the problem could be.
>>>>
>>>> Is the error message still the same on as in the first mail?
>>>> Did you do the envorpnment/sshd_config on both machines?
>>>> What shell are you using?
>>>>
>>>> On other test you could make is to start your application
>>>> with the --prefix option:
>>>>
>>>> $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main
>>>>
>>>> (assuming your Open MPI installation lies in /opt/openmpi
>>>> on both machines)
>>>>
>>>>
>>>> Jody
>>>>
>>>> On 10/1/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>> Hi Jodi,
>>>>> did the steps as you said, but it didn't work for me.
>>>>> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and
>>>>> made the changes to sshd_config.
>>>>>
>>>>> But this all didn't solve my problem, although the pahts seemed to be
>>>>> set correctly (judging what ssh saturn `printenv >> test` says). I also
>>>>> restarted the ssh server, the error is the same.
>>>>>
>>>>> Hope you can help me out here and thanks for your help so far
>>>>> dino
>>>>>
>>>>> jody schrieb:
>>>>>> Dino -
>>>>>> I had a similar problem.
>>>>>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH
>>>>>> in the file ~/ssh/environment on the client and setting
>>>>>> PermitUserEnvironment yes
>>>>>> in /etc/ssh/sshd_config on the server (for this you need root
>>>>>> prioviledge though)
>>>>>>
>>>>>> To be on the safe side, i did both on all my nodes
>>>>>>
>>>>>> Jody
>>>>>>
>>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>> Hi Jody,
>>>>>>>
>>>>>>> Thanks for your help, it really is the case that either in PATH nor in
>>>>>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
>>>>>>> hope it works.
>>>>>>>
>>>>>>> jody schrieb:
>>>>>>>> Hi Dino
>>>>>>>>
>>>>>>>> Try
>>>>>>>> ssh saturn printenv | grep PATH
>>>>>>>>
>>>>>>>> >from your host sun to see what your environment variables are when
>>>>>>>>
>>>>>>>> ssh is run without a shell.
>>>>>>>>
>>>>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a problem running a simple programm mpihello.cpp.
>>>>>>>>>
>>>>>>>>> Here is a excerp of the error and the command
>>>>>>>>> root_at_sun:~# mpirun -H sun,saturn main
>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>>> at line 1164
>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
>>>>>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start
>>>>>>>>> as expected. [sun:25213] ERROR: There may be more information
>>>>>>>>> available from [sun:25213] ERROR: the remote shell (see above).
>>>>>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>>> base/pls_base_orted_cmds.c at line 188
>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>>> at line 1196
>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>> ------ mpirun was unable to cleanly terminate the daemons for this
>>>>>>>>> job. Returned value Timeout instead of ORTE_SUCCESS.
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>> ------
>>>>>>>>>
>>>>>>>>> The program is runable from each node alone (mpirun -np2 main)
>>>>>>>>>
>>>>>>>>> My PathVariables:
>>>>>>>>> $PATH
>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>>> -4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>>> -4/:/usr/lib:/usr/local/lib
>>>>>>>>>
>>>>>>>>> Passwordless ssh is up 'n running
>>>>>>>>>
>>>>>>>>> I walked through the FAQ and Mailing Lists but couldn't find any
>>>>>>>>> solution for my problem.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Dino R.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users