Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Dino Rossegger (dino.rossegger_at_[hidden])
Date: 2007-10-02 03:29:34


Hi again,

Tim Prins schrieb:
> Hi,
>
> On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:
>> Hi again,
>>
>> Yes the error output is the same:
>> root_at_sun:~# mpirun --hostfile hostfile main
>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1164
>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>> [sun:23748] ERROR: A daemon on node saturn failed to start as expected.
>> [sun:23748] ERROR: There may be more information available from
>> [sun:23748] ERROR: the remote shell (see above).
>> [sun:23748] ERROR: The daemon exited unexpectedly with status 255.
>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 188
>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1196
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons for this job.
>> Returned value Timeout instead of ORTE_SUCCESS.
>>
>> --------------------------------------------------------------------------
> Can you try:
> mpirun --debug-daemons --hostfile hostfile main
>
Did it but it doesn't give me any special output (as far as I can see that)
Heres the output:
root_at_sun:~# mpirun --debug-daemons --hostfile hostfile ./main
Daemon [0,0,1] checking in as pid 27168 on host sun
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 275
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1164
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp
.c at line 90
[sun:27167] ERROR: A daemon on node saturn failed to start as
expected.
[sun:27167] ERROR: There may be more information available fro
m
[sun:27167] ERROR: the remote shell (see above).
[sun:27167] ERROR: The daemon exited unexpectedly with status
255.
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received exit

[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 188
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1196
--------------------------------------------------------------
------------
mpirun was unable to cleanly terminate the daemons for this jo
b. Returned value Timeout instead of ORTE_SUCCESS.

--------------------------------------------------------------
------------

> This may give more output about the error. Also, try
> mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main

Heres the output, but I cant decipher it ^^
root_at_sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil
e hostfile main
[sun:27175] pls:rsh: local csh: 0, local sh: 1
[sun:27175] pls:rsh: assuming same remote shell as local shell
[sun:27175] pls:rsh: remote csh: 0, remote sh: 1
[sun:27175] pls:rsh: final template argv:
[sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp
roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena
me <template> --universe root_at_sun:default-universe-27175 --nsr
eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733
" --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.
202:4733"
[sun:27175] pls:rsh: launching on node sun
[sun:27175] pls:rsh: sun is a LOCAL node
[sun:27175] pls:rsh: changing to directory /root
[sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted -
-bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden
ame sun --universe root_at_sun:default-universe-27175 --nsreplica
 "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp
rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47
33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash
SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L
D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b
in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK
=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr
/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib
PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH
_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin
/mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O
MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
[sun:27175] pls:rsh: launching on node saturn
[sun:27175] pls:rsh: saturn is a REMOTE node
[sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s

                                    aturn orted --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_st

   art 0 --nodename saturn --universe root_at_sun:default-universe-2

                                       7175 --nsreplica
"0.0.0;tcp://192.168.1.254:4733;tcp://172.16.

                   0.202:4733" --gprreplica
"0.0.0;tcp://192.168.1.254:4733;tcp:/

           /172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm
SHELL=/bin
                                                        /bash
SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=

                              root
LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:

/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT

H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT

H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca

                                    l/lib PWD=/root LANG=en_US.UTF-8
SHLVL=1 HOME=/root LOGNAME=ro

   ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc

                                       al/bin/mpirun
OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo

                      bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:27175] ERROR: A daemon on node saturn failed to start as expected.
[sun:27175] ERROR: There may be more information available from
[sun:27175] ERROR: the remote shell (see above).
[sun:27175] ERROR: The daemon exited unexpectedly with status 255.
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.

--------------------------------------------------------------------------

> This will print out the exact command that is used to launch the orted.
>
> Also, I would highly recommend not running Open MPI as root. It is just a bad
> idea.

Yes I know, I'm doing it just now for testing.
>> I wrote the following to my .ssh/environment (on all machines)
>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>> n:/opt/c3-4/:/usr/lib:/usr/local/lib;
>>
>> PATH=$PATH:/usr/local/lib;
>>
>> export LD_LIBRARY_PATH;
>> export PATH;
>>
>> and added the statement you told me to the ssd_config (on all machines):
>> PermitUserEnvironment yes
>>
>> And it seems to me that the pathes are correct now.
>>
>> My shell is bash (/bin/bash)
>>
>> When running locate orted (to find out where exactly my openmpi
>> installation is (compilation defaults) i saw that, on sun there was a
>> /usr/bin/orted while there wasn't one on saturn.
>> I deleted /usr/bin/orted on sun and tried again with the option --prefix
>> /usr/local/ (which seems to be my installation directory) but it
>> didn't work (same error).
> Is it possible that you are mixing 2 different installations of Open MPI? You
> may consider installing OpenMPI to a NFS drive to make these things a bit
> easier.
>> Is there a script or anything like that with which I can uninstall
>> openmpi, because i'll might try a new compilation to /opt/openmpi since
>> it doesn't look like I would be able to solve the problem.
> If you still have the tree around that you used to 'make' Open MPI, you can
> just go into that tree and type 'make uninstall'.
>
> Hope this helps,
>
> Tim
>
>> jody schrieb:
>>> Now that the PATHs seem to be set correctly for
>>> ssh i don't know what the problem could be.
>>>
>>> Is the error message still the same on as in the first mail?
>>> Did you do the envorpnment/sshd_config on both machines?
>>> What shell are you using?
>>>
>>> On other test you could make is to start your application
>>> with the --prefix option:
>>>
>>> $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main
>>>
>>> (assuming your Open MPI installation lies in /opt/openmpi
>>> on both machines)
>>>
>>>
>>> Jody
>>>
>>> On 10/1/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>> Hi Jodi,
>>>> did the steps as you said, but it didn't work for me.
>>>> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and
>>>> made the changes to sshd_config.
>>>>
>>>> But this all didn't solve my problem, although the pahts seemed to be
>>>> set correctly (judging what ssh saturn `printenv >> test` says). I also
>>>> restarted the ssh server, the error is the same.
>>>>
>>>> Hope you can help me out here and thanks for your help so far
>>>> dino
>>>>
>>>> jody schrieb:
>>>>> Dino -
>>>>> I had a similar problem.
>>>>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH
>>>>> in the file ~/ssh/environment on the client and setting
>>>>> PermitUserEnvironment yes
>>>>> in /etc/ssh/sshd_config on the server (for this you need root
>>>>> prioviledge though)
>>>>>
>>>>> To be on the safe side, i did both on all my nodes
>>>>>
>>>>> Jody
>>>>>
>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>> Hi Jody,
>>>>>>
>>>>>> Thanks for your help, it really is the case that either in PATH nor in
>>>>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
>>>>>> hope it works.
>>>>>>
>>>>>> jody schrieb:
>>>>>>> Hi Dino
>>>>>>>
>>>>>>> Try
>>>>>>> ssh saturn printenv | grep PATH
>>>>>>>
>>>>>>> >from your host sun to see what your environment variables are when
>>>>>>>
>>>>>>> ssh is run without a shell.
>>>>>>>
>>>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a problem running a simple programm mpihello.cpp.
>>>>>>>>
>>>>>>>> Here is a excerp of the error and the command
>>>>>>>> root_at_sun:~# mpirun -H sun,saturn main
>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>> at line 1164
>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
>>>>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start
>>>>>>>> as expected. [sun:25213] ERROR: There may be more information
>>>>>>>> available from [sun:25213] ERROR: the remote shell (see above).
>>>>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>> base/pls_base_orted_cmds.c at line 188
>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>> at line 1196
>>>>>>>> --------------------------------------------------------------------
>>>>>>>> ------ mpirun was unable to cleanly terminate the daemons for this
>>>>>>>> job. Returned value Timeout instead of ORTE_SUCCESS.
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------
>>>>>>>> ------
>>>>>>>>
>>>>>>>> The program is runable from each node alone (mpirun -np2 main)
>>>>>>>>
>>>>>>>> My PathVariables:
>>>>>>>> $PATH
>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>> -4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>> -4/:/usr/lib:/usr/local/lib
>>>>>>>>
>>>>>>>> Passwordless ssh is up 'n running
>>>>>>>>
>>>>>>>> I walked through the FAQ and Mailing Lists but couldn't find any
>>>>>>>> solution for my problem.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Dino R.
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>