Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Dino Rossegger (dino.rossegger_at_[hidden])
Date: 2007-10-04 02:02:45


I'll try to reinstall openmpi on a nfsdevice, maybe it works then.

Thanks for your help
dino

Tim Prins schrieb:
> Unfortunately, I am out of ideas on this one. It is very strange. Maybe
> someone else has an idea.
>
> I would recommend trying to install Open MPI again. First be sure to get
> rid of all of the old installs of Open MPI from all your nodes, then
> reinstall and try again.
>
> Tim
>
> Dino Rossegger wrote:
>> Here the Syntax & Output of the Command:
>> root_at_sun:~# mpirun --hostfile hostfile saturn
>> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 275
>> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1164
>> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>> [sun:28777] ERROR: A daemon on node saturn failed to start as expected.
>> [sun:28777] ERROR: There may be more information available from
>> [sun:28777] ERROR: the remote shell (see above).
>> [sun:28777] ERROR: The daemon exited unexpectedly with status 255.
>> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>> base/pls_base_orted_cmds.c at line 188
>> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>> line 1196
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons for this job.
>> Returned value Timeout instead of ORTE_SUCCESS.
>>
>> --------------------------------------------------------------------------
>>
>> I'm using version 1.2.3, got it from openmpi.org. I'm using the same
>> version of openmpi on all nodes.
>>
>> Thanks
>> dino
>>
>> Tim Prins schrieb:
>>> This is very odd. The daemon is being launched properly, but then things
>>> get strange. It looks like mpirun is sending a message to kill
>>> application processes on saturn.
>>>
>>> What version of Open MPI are you using?
>>>
>>> Are you sure that the same version of Open MPI us being used everywhere?
>>>
>>> Can you try:
>>> mpirun --hostfile hostfile hostname
>>>
>>> Thanks,
>>>
>>> Tim
>>>
>>> Dino Rossegger wrote:
>>>> Hi again,
>>>>
>>>> Tim Prins schrieb:
>>>>> Hi,
>>>>>
>>>>> On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> Yes the error output is the same:
>>>>>> root_at_sun:~# mpirun --hostfile hostfile main
>>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>>>>> line 1164
>>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>>>>>> [sun:23748] ERROR: A daemon on node saturn failed to start as expected.
>>>>>> [sun:23748] ERROR: There may be more information available from
>>>>>> [sun:23748] ERROR: the remote shell (see above).
>>>>>> [sun:23748] ERROR: The daemon exited unexpectedly with status 255.
>>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>> base/pls_base_orted_cmds.c at line 188
>>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>>>>> line 1196
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun was unable to cleanly terminate the daemons for this job.
>>>>>> Returned value Timeout instead of ORTE_SUCCESS.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>> Can you try:
>>>>> mpirun --debug-daemons --hostfile hostfile main
>>>>>
>>>> Did it but it doesn't give me any special output (as far as I can see that)
>>>> Heres the output:
>>>> root_at_sun:~# mpirun --debug-daemons --hostfile hostfile ./main
>>>> Daemon [0,0,1] checking in as pid 27168 on host sun
>>>> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
>>>> ,0]
>>>> [sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs
>>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
>>>> ase_orted_cmds.c at line 275
>>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
>>>> dule.c at line 1164
>>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp
>>>> .c at line 90
>>>> [sun:27167] ERROR: A daemon on node saturn failed to start as
>>>> expected.
>>>> [sun:27167] ERROR: There may be more information available fro
>>>> m
>>>> [sun:27167] ERROR: the remote shell (see above).
>>>> [sun:27167] ERROR: The daemon exited unexpectedly with status
>>>> 255.
>>>> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
>>>> ,0]
>>>> [sun:27168] [0,0,1] orted_recv_pls: received exit
>>>>
>>>>
>>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
>>>> ase_orted_cmds.c at line 188
>>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
>>>> dule.c at line 1196
>>>> --------------------------------------------------------------
>>>> ------------
>>>> mpirun was unable to cleanly terminate the daemons for this jo
>>>> b. Returned value Timeout instead of ORTE_SUCCESS.
>>>>
>>>> --------------------------------------------------------------
>>>> ------------
>>>>
>>>>> This may give more output about the error. Also, try
>>>>> mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main
>>>> Heres the output, but I cant decipher it ^^
>>>> root_at_sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil
>>>> e hostfile main
>>>> [sun:27175] pls:rsh: local csh: 0, local sh: 1
>>>> [sun:27175] pls:rsh: assuming same remote shell as local shell
>>>> [sun:27175] pls:rsh: remote csh: 0, remote sh: 1
>>>> [sun:27175] pls:rsh: final template argv:
>>>> [sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp
>>>> roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena
>>>> me <template> --universe root_at_sun:default-universe-27175 --nsr
>>>> eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733
>>>> " --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.
>>>> 202:4733"
>>>> [sun:27175] pls:rsh: launching on node sun
>>>> [sun:27175] pls:rsh: sun is a LOCAL node
>>>> [sun:27175] pls:rsh: changing to directory /root
>>>> [sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted -
>>>> -bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden
>>>> ame sun --universe root_at_sun:default-universe-27175 --nsreplica
>>>> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp
>>>> rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47
>>>> 33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash
>>>> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L
>>>> D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b
>>>> in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK
>>>> =/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr
>>>> /local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib
>>>> PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH
>>>> _CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin
>>>> /mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O
>>>> MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
>>>> [sun:27175] pls:rsh: launching on node saturn
>>>> [sun:27175] pls:rsh: saturn is a REMOTE node
>>>> [sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s
>>>>
>>>> aturn orted --bootproxy 1 --name
>>>> 0.0.2 --num_procs 3 --vpid_st
>>>>
>>>> art 0 --nodename saturn --universe root_at_sun:default-universe-2
>>>>
>>>> 7175 --nsreplica
>>>> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.
>>>>
>>>> 0.202:4733" --gprreplica
>>>> "0.0.0;tcp://192.168.1.254:4733;tcp:/
>>>>
>>>> /172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm
>>>> SHELL=/bin
>>>> /bash
>>>> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=
>>>>
>>>> root
>>>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:
>>>>
>>>>
>>>> /usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT
>>>>
>>>>
>>>> H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT
>>>>
>>>>
>>>> H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca
>>>>
>>>> l/lib PWD=/root LANG=en_US.UTF-8
>>>> SHLVL=1 HOME=/root LOGNAME=ro
>>>>
>>>> ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc
>>>>
>>>> al/bin/mpirun
>>>> OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo
>>>>
>>>> bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
>>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> base/pls_base_orted_cmds.c at line 275
>>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>>> line 1164
>>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
>>>> [sun:27175] ERROR: A daemon on node saturn failed to start as expected.
>>>> [sun:27175] ERROR: There may be more information available from
>>>> [sun:27175] ERROR: the remote shell (see above).
>>>> [sun:27175] ERROR: The daemon exited unexpectedly with status 255.
>>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>> base/pls_base_orted_cmds.c at line 188
>>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
>>>> line 1196
>>>> --------------------------------------------------------------------------
>>>> mpirun was unable to cleanly terminate the daemons for this job.
>>>> Returned value Timeout instead of ORTE_SUCCESS.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>>> This will print out the exact command that is used to launch the orted.
>>>>>
>>>>> Also, I would highly recommend not running Open MPI as root. It is just a bad
>>>>> idea.
>>>> Yes I know, I'm doing it just now for testing.
>>>>>> I wrote the following to my .ssh/environment (on all machines)
>>>>>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
>>>>>> n:/opt/c3-4/:/usr/lib:/usr/local/lib;
>>>>>>
>>>>>> PATH=$PATH:/usr/local/lib;
>>>>>>
>>>>>> export LD_LIBRARY_PATH;
>>>>>> export PATH;
>>>>>>
>>>>>> and added the statement you told me to the ssd_config (on all machines):
>>>>>> PermitUserEnvironment yes
>>>>>>
>>>>>> And it seems to me that the pathes are correct now.
>>>>>>
>>>>>> My shell is bash (/bin/bash)
>>>>>>
>>>>>> When running locate orted (to find out where exactly my openmpi
>>>>>> installation is (compilation defaults) i saw that, on sun there was a
>>>>>> /usr/bin/orted while there wasn't one on saturn.
>>>>>> I deleted /usr/bin/orted on sun and tried again with the option --prefix
>>>>>> /usr/local/ (which seems to be my installation directory) but it
>>>>>> didn't work (same error).
>>>>> Is it possible that you are mixing 2 different installations of Open MPI? You
>>>>> may consider installing OpenMPI to a NFS drive to make these things a bit
>>>>> easier.
>>>>>> Is there a script or anything like that with which I can uninstall
>>>>>> openmpi, because i'll might try a new compilation to /opt/openmpi since
>>>>>> it doesn't look like I would be able to solve the problem.
>>>>> If you still have the tree around that you used to 'make' Open MPI, you can
>>>>> just go into that tree and type 'make uninstall'.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Tim
>>>>>
>>>>>> jody schrieb:
>>>>>>> Now that the PATHs seem to be set correctly for
>>>>>>> ssh i don't know what the problem could be.
>>>>>>>
>>>>>>> Is the error message still the same on as in the first mail?
>>>>>>> Did you do the envorpnment/sshd_config on both machines?
>>>>>>> What shell are you using?
>>>>>>>
>>>>>>> On other test you could make is to start your application
>>>>>>> with the --prefix option:
>>>>>>>
>>>>>>> $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main
>>>>>>>
>>>>>>> (assuming your Open MPI installation lies in /opt/openmpi
>>>>>>> on both machines)
>>>>>>>
>>>>>>>
>>>>>>> Jody
>>>>>>>
>>>>>>> On 10/1/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>>> Hi Jodi,
>>>>>>>> did the steps as you said, but it didn't work for me.
>>>>>>>> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and
>>>>>>>> made the changes to sshd_config.
>>>>>>>>
>>>>>>>> But this all didn't solve my problem, although the pahts seemed to be
>>>>>>>> set correctly (judging what ssh saturn `printenv >> test` says). I also
>>>>>>>> restarted the ssh server, the error is the same.
>>>>>>>>
>>>>>>>> Hope you can help me out here and thanks for your help so far
>>>>>>>> dino
>>>>>>>>
>>>>>>>> jody schrieb:
>>>>>>>>> Dino -
>>>>>>>>> I had a similar problem.
>>>>>>>>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH
>>>>>>>>> in the file ~/ssh/environment on the client and setting
>>>>>>>>> PermitUserEnvironment yes
>>>>>>>>> in /etc/ssh/sshd_config on the server (for this you need root
>>>>>>>>> prioviledge though)
>>>>>>>>>
>>>>>>>>> To be on the safe side, i did both on all my nodes
>>>>>>>>>
>>>>>>>>> Jody
>>>>>>>>>
>>>>>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>>>>> Hi Jody,
>>>>>>>>>>
>>>>>>>>>> Thanks for your help, it really is the case that either in PATH nor in
>>>>>>>>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
>>>>>>>>>> hope it works.
>>>>>>>>>>
>>>>>>>>>> jody schrieb:
>>>>>>>>>>> Hi Dino
>>>>>>>>>>>
>>>>>>>>>>> Try
>>>>>>>>>>> ssh saturn printenv | grep PATH
>>>>>>>>>>>
>>>>>>>>>>> >from your host sun to see what your environment variables are when
>>>>>>>>>>>
>>>>>>>>>>> ssh is run without a shell.
>>>>>>>>>>>
>>>>>>>>>>> On 9/27/07, Dino Rossegger <dino.rossegger_at_[hidden]> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a problem running a simple programm mpihello.cpp.
>>>>>>>>>>>>
>>>>>>>>>>>> Here is a excerp of the error and the command
>>>>>>>>>>>> root_at_sun:~# mpirun -H sun,saturn main
>>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>>>>>> base/pls_base_orted_cmds.c at line 275
>>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>>>>>> at line 1164
>>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
>>>>>>>>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start
>>>>>>>>>>>> as expected. [sun:25213] ERROR: There may be more information
>>>>>>>>>>>> available from [sun:25213] ERROR: the remote shell (see above).
>>>>>>>>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255.
>>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>>>>>>>>>>> base/pls_base_orted_cmds.c at line 188
>>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
>>>>>>>>>>>> at line 1196
>>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>>> ------ mpirun was unable to cleanly terminate the daemons for this
>>>>>>>>>>>> job. Returned value Timeout instead of ORTE_SUCCESS.
>>>>>>>>>>>>
>>>>>>>>>>>> --------------------------------------------------------------------
>>>>>>>>>>>> ------
>>>>>>>>>>>>
>>>>>>>>>>>> The program is runable from each node alone (mpirun -np2 main)
>>>>>>>>>>>>
>>>>>>>>>>>> My PathVariables:
>>>>>>>>>>>> $PATH
>>>>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>>>>>> -4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
>>>>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
>>>>>>>>>>>> -4/:/usr/lib:/usr/local/lib
>>>>>>>>>>>>
>>>>>>>>>>>> Passwordless ssh is up 'n running
>>>>>>>>>>>>
>>>>>>>>>>>> I walked through the FAQ and Mailing Lists but couldn't find any
>>>>>>>>>>>> solution for my problem.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dino R.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>