Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-07-27 13:13:46


Are you not using the built-in OMPI support for Torque? The ssh keys
should be irrelevant if using the TM API in Torque (i.e., OMPI won't
be using ssh to launch remote processes; we use the internal TM API
in Torque).

On Jul 27, 2007, at 11:38 AM, Adams, Samuel D Contr AFRL/HEDR wrote:

> I deleted all of the entries out of the know_hosts file, but that
> didn't
> seem to help. I can run jobs just fine without torque on multiple
> nodes. I can also ssh to all nodes without using passwords, so I
> am not
> sure what the deal is.
>
> ...
>
> Okay, I found the problem. The keys that I had in know_hosts were for
> only the hostname i.e. prodnode2; whereas, the hostname that torque
> was
> using were fully qualified names i.e. prodnode2.brooks.af.mil and the
> keys did not exist for the fully qualified names.
>
> Thanks for the help.
>
> Sam Adams
> General Dynamics Information Technology
> Phone: 210.536.5945
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
> mpi.org] On
> Behalf Of George Bosilca
> Sent: Friday, July 27, 2007 10:13 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] torque and openmpi
>
> The key is in the first line of the provided output. One of the
> connection failed because a wrong ssh key. Clean your .ssh/
> known_hosts and the problem will vanish.
>
> Thanks,
> george.
>
> On Jul 27, 2007, at 11:01 AM, Adams, Samuel D Contr AFRL/HEDR wrote:
>
>> When I run jobs with torque, I get this error message. Any ideas?
>>
>> [sam_at_prodnode1 all]$ cat script.sh.err
>> Host key verification failed.
>> [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
>> file
>> base/pls_base_orted_cmds.c at line 275
>> [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
>> file
>> pls_rsh_module.c at line 1164
>> [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
>> file
>> errmgr_hnp.c at line 90
>> [prodnode3.brooks.af.mil:03321] ERROR: A daemon on node
>> prodnode2.brooks.af.mil failed to start as expected.
>> [prodnode3.brooks.af.mil:03321] ERROR: There may be more information
>> available from
>> [prodnode3.brooks.af.mil:03321] ERROR: the remote shell (see above).
>> [prodnode3.brooks.af.mil:03321] ERROR: The daemon exited unexpectedly
>> with status 255.
>> [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
>> file
>> base/pls_base_orted_cmds.c at line 188
>> [prodnode3.brooks.af.mil:03321] [0,0,0] ORTE_ERROR_LOG: Timeout in
>> file
>> pls_rsh_module.c at line 1196
>> ---------------------------------------------------------------------
>> -
>
>> --
>> --
>> mpirun was unable to cleanly terminate the daemons for this job.
>> Returned value Timeout instead of ORTE_SUCCESS.
>>
>> ---------------------------------------------------------------------
>> -
>
>> --
>> --
>>
>> Sam Adams
>> General Dynamics Information Technology
>> Phone: 210.536.5945
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems