Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-12-15 12:45:32


On Dec 15, 2010, at 10:14 AM, Gilbert Grosdidier wrote:

> Bonjour Ralph,
>
> Thanks for taking time to help me.
>
> Le 15 déc. 10 à 16:27, Ralph Castain a écrit :
>
>> It would appear that there is something trying to talk to a socket opened by one of your daemons. At a guess, I would bet the problem is that a prior job left a daemon alive that is talking on the same socket.
>
> gg= At first glance, this could be possible, although I got no evidence
> about it when looking for ghost processes of mine on the relevant nodes.
>
>>
>> Are you by chance using static ports for the job?
>
> gg= How could I know that ?
> Is there an easy way to workaround these static ports ?
> Would it prevent the jobs to collide ghost jobs/processes as suggested below, please ?
> I did not spot any info about static ports inside of ompi_info output ... ;-)

It wouldn't happen by default - you would have had to tell us to use static ports by specifying an OOB port range. If you didn't do that (and remember, it could be in a default mca param file!), then the ports are dynamically assigned.

>
>> Did you run another job just before this one that might have left a daemon somewhere?
>
> gg= Again, it could be possible that with my many jobs crashing over the cluster,
> PBS was unable to clean up the nodes in time before restarting a new one.
> But I have no evidence.
>
> The exact full error message was like this:
> [r36i3n15:18992] [[1468,0],254]-[[1468,0],14] mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[1468,1],1643]
>
> From some debug info I got, process 1468 seems to relate to node rank 0 (r33i0n0),
> while process 1643 seems to originates from node r36i0n14.

The "1468,1" is an arbitrary identifier for the overall job. The "1643" indicates that it is an MPI process (rank=1643) within that job that provided the bad identifier.

The "1468,0" identifiers in the early part of the message indicate that the error occurred on a port being used by two ORTE daemons for communication. Somehow, an MPI process (rank=1643) injected a message into that link.

It looks like all the messages are flowing within a single job (all three processes mentioned in the error have the same identifier). Only possibility I can think of is that somehow you are reusing ports - is it possible your system doesn't have enough ports to support all the procs?

I confess I'm a little at a loss - never seen this problem before, and we run on very large clusters.

>
> But, indeed, none of r33i0n0, r36i0n14 or r36i3n15 exhibits any process like 1468 or 1643,
> while process 18992 is indeed the master one on r36i3n15.
>
> Thanks, Best, G.
>
>
>>
>>
>> On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:
>>
>>> Bonjour,
>>>
>>> Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
>>> this error message, right at startup :
>>> mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[13816,0],209]
>>>
>>> and the whole job is going to spin for an undefined period, without crashing/aborting.
>>>
>>> What could be the culprit please ?
>>> Is there a workaround ?
>>> Which parameter is to be tuned ?
>>>
>>> Thanks in advance for any help, Best, G.
>>>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users