Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
From: Sindhi, Waris PW (Waris.Sindhi_at_[hidden])
Date: 2011-04-27 15:31:33


No we do not have a firewall turned on. I can run smaller 96 slave cases
on ln10 and ln13 included on the slavelist.

Could there be another reason for this to fail ?

Sincerely,

Waris Sindhi
High Performance Computing, TechApps
Pratt & Whitney, UTC
(860)-565-8486

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Ralph Castain
Sent: Wednesday, April 27, 2011 2:18 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded

Perhaps a firewall? All it is telling you is that mpirun couldn't
establish TCP communications with the daemon on ln10.

On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:

> Hi,
> I am getting a "oob-tcp: Communication retries exceeded" error
> message when I run a 238 MPI slave code
>
>
> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>
------------------------------------------------------------------------
> --
> mpirun was unable to start the specified application as it encountered
> an error:
>
> Error name: Unknown error: 1
> Node: ln10
>
> when attempting to start process rank 234.
>
------------------------------------------------------------------------
> --
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
> orted/orted_comm.c at line 130
> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
> orted/orted_comm.c at line 130
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
retries
> exceeded. Can not communicate with peer
>
> Any help would be greatly appreciated.
>
> Sincerely,
>
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users