Yes the procgroup file has more than 128 applications in it.
% wc -l procgroup
239 procgroup
Is 128 the max applications that can be in a procgroup file ?
Sincerely,
Waris Sindhi
High Performance Computing, TechApps
Pratt & Whitney, UTC
(860)-565-8486
-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Ralph Castain
Sent: Wednesday, April 27, 2011 8:09 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
> No we do not have a firewall turned on. I can run smaller 96 slave
cases
> on ln10 and ln13 included on the slavelist.
>
> Could there be another reason for this to fail ?
What is in "procgroup"? Is it a single application?
Offhand, there is nothing in OMPI that would explain the problem. The
only possibility I can think of would be if your "procgroup" file
contains more than 128 applications in it.
>
>
> Sincerely,
>
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
On
> Behalf Of Ralph Castain
> Sent: Wednesday, April 27, 2011 2:18 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>
> Perhaps a firewall? All it is telling you is that mpirun couldn't
> establish TCP communications with the daemon on ln10.
>
>
> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>
>> Hi,
>> I am getting a "oob-tcp: Communication retries exceeded" error
>> message when I run a 238 MPI slave code
>>
>>
>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>>
>
------------------------------------------------------------------------
>> --
>> mpirun was unable to start the specified application as it
encountered
>> an error:
>>
>> Error name: Unknown error: 1
>> Node: ln10
>>
>> when attempting to start process rank 234.
>>
>
------------------------------------------------------------------------
>> --
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>> orted/orted_comm.c at line 130
>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>> orted/orted_comm.c at line 130
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
> retries
>> exceeded. Can not communicate with peer
>>
>> Any help would be greatly appreciated.
>>
>> Sincerely,
>>
>> Waris Sindhi
>> High Performance Computing, TechApps
>> Pratt & Whitney, UTC
>> (860)-565-8486
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
|