Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Tim Campbell (tim.campbell_at_[hidden])
Date: 2007-09-13 13:24:43


Thanks.

I think I figured out the problem. I found that in my .ssh/
known_hosts there were several "bad" keys associated with some of the
machines in the gridengine pool. My hypothesis is that when mpirun
was establishing the connection topology of the processes there was
some process pairs that failed to complete the connection due to the
bad ssh keys. I don't have explicit evidence for this since there
was no ssh error output generated.

I generated new keys for all the amd64 machines in the gridengine
pool for which there was an offending key. Now my job runs with a
set of machines that includes ones that had previously failed. I
will assume for now that the problem is fixed.

~Tim

On Sep 13, 2007, at 12:06 PM, Pak Lui wrote:

> Hi Tim,
>
> You could try setting -mca pls_gridengine_verbose 1 to show whether
> SGE
> is able to start the ORTE daemons on the remote nodes successfully.
>
> It seems you are having the problem previously asked by another user,
> Perhaps you may want to follow this thread and check your ifconfig
> settings to see if anything suspicious?
> http://www.open-mpi.org/community/lists/users/2007/02/2669.php
>
> My 2 cents...
>
> Tim Campbell wrote:
>> Greetings,
>>
>> I am using OpenMPI v1.2.3 via SGE on a network of amd64
>> workstations. When mpirun tries to start the processes on certain
>> nodes I get the following error output.
>>
>> [sr70][0,1,2][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=111
>> [sr71][0,1,3][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=111
>>
>> Using perl -e 'die$!=111' I see that the error message is "Connection
>> refused". I am able to connect to both nodes in question via ssh
>> and/
>> or rsh. I changed btl_base_debug to 2, but that did not provide
>> additional information.
>>
>> What are some possible issues that might be causing this? What can I
>> do to get more information?
>>
>> Thanks,
>> ~Tim
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
>
> - Pak Lui
> pak.lui_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>