Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-04-28 08:45:37


On Apr 28, 2011, at 6:04 AM, Jeff Squyres wrote:

> I do note that you are using an ancient version of Open MPI (1.2.8).

I don't think that is accurate - at least, the output doesn't match that old a version. The process name format is indicative of something 1.3 or more recent.

What lead you to conclude 1.2.8?

> Is there any way you can upgrade to a (much) later version, such as 1.4.3? That might improve your TCP connectivity -- we made improvements in those portions of the code over the years.
>
> On Apr 27, 2011, at 8:09 PM, Ralph Castain wrote:
>
>>
>> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
>>
>>> No we do not have a firewall turned on. I can run smaller 96 slave cases
>>> on ln10 and ln13 included on the slavelist.
>>>
>>> Could there be another reason for this to fail ?
>>
>> What is in "procgroup"? Is it a single application?
>>
>> Offhand, there is nothing in OMPI that would explain the problem. The only possibility I can think of would be if your "procgroup" file contains more than 128 applications in it.
>>
>>>
>>>
>>> Sincerely,
>>>
>>> Waris Sindhi
>>> High Performance Computing, TechApps
>>> Pratt & Whitney, UTC
>>> (860)-565-8486
>>>
>>> -----Original Message-----
>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>>> Behalf Of Ralph Castain
>>> Sent: Wednesday, April 27, 2011 2:18 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>>>
>>> Perhaps a firewall? All it is telling you is that mpirun couldn't
>>> establish TCP communications with the daemon on ln10.
>>>
>>>
>>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>>>
>>>> Hi,
>>>> I am getting a "oob-tcp: Communication retries exceeded" error
>>>> message when I run a 238 MPI slave code
>>>>
>>>>
>>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>>>>
>>> ------------------------------------------------------------------------
>>>> --
>>>> mpirun was unable to start the specified application as it encountered
>>>> an error:
>>>>
>>>> Error name: Unknown error: 1
>>>> Node: ln10
>>>>
>>>> when attempting to start process rank 234.
>>>>
>>> ------------------------------------------------------------------------
>>>> --
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>>> orted/orted_comm.c at line 130
>>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>>> orted/orted_comm.c at line 130
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded. Can not communicate with peer
>>>>
>>>> Any help would be greatly appreciated.
>>>>
>>>> Sincerely,
>>>>
>>>> Waris Sindhi
>>>> High Performance Computing, TechApps
>>>> Pratt & Whitney, UTC
>>>> (860)-565-8486
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users