Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix
From: Gilbert Grosdidier (Gilbert.Grosdidier_at_[hidden])
Date: 2010-12-15 12:14:24


Bonjour Ralph,

  Thanks for taking time to help me.

Le 15 déc. 10 à 16:27, Ralph Castain a écrit :

> It would appear that there is something trying to talk to a socket
> opened by one of your daemons. At a guess, I would bet the problem
> is that a prior job left a daemon alive that is talking on the same
> socket.

gg= At first glance, this could be possible, although I got no evidence
about it when looking for ghost processes of mine on the relevant nodes.

>
> Are you by chance using static ports for the job?

gg= How could I know that ?
Is there an easy way to workaround these static ports ?
Would it prevent the jobs to collide ghost jobs/processes as suggested
below, please ?
I did not spot any info about static ports inside of ompi_info
output ... ;-)

> Did you run another job just before this one that might have left a
> daemon somewhere?

gg= Again, it could be possible that with my many jobs crashing over
the cluster,
PBS was unable to clean up the nodes in time before restarting a new
one.
But I have no evidence.

  The exact full error message was like this:
[r36i3n15:18992] [[1468,0],254]-[[1468,0],14]
mca_oob_tcp_peer_recv_connect_ack: received unexpected process
identifier [[1468,1],1643]

  From some debug info I got, process 1468 seems to relate to node
rank 0 (r33i0n0),
while process 1643 seems to originates from node r36i0n14.

  But, indeed, none of r33i0n0, r36i0n14 or r36i3n15 exhibits any
process like 1468 or 1643,
while process 18992 is indeed the master one on r36i3n15.

  Thanks, Best, G.

>
>
> On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:
>
>> Bonjour,
>>
>> Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores,
>> I got
>> this error message, right at startup :
>> mca_oob_tcp_peer_recv_connect_ack: received unexpected process
>> identifier [[13816,0],209]
>>
>> and the whole job is going to spin for an undefined period, without
>> crashing/aborting.
>>
>> What could be the culprit please ?
>> Is there a workaround ?
>> Which parameter is to be tuned ?
>>
>> Thanks in advance for any help, Best, G.
>>