Le 15 déc. 10 à 16:27, Ralph Castain a écrit :
It would appear that there is something trying to talk to a socket opened by one of your daemons. At a guess, I would bet the problem is that a prior job left a daemon alive that is talking on the same socket.
gg= At first glance, this could be possible, although I got no evidence
about it when looking for ghost processes of mine on the relevant nodes.
Are you by chance using static ports for the job?
gg= How could I know that ?
Is there an easy way to workaround these static ports ?
Would it prevent the jobs to collide ghost jobs/processes as suggested below, please ?
I did not spot any info about static ports inside of ompi_info output ... ;-)
Did you run another job just before this one that might have left a daemon somewhere?
gg= Again, it could be possible that with my many jobs crashing over the cluster,
PBS was unable to clean up the nodes in time before restarting a new one.
But I have no evidence.
The exact full error message was like this:
[r36i3n15:18992] [[1468,0],254]-[[1468,0],14] mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[1468,1],1643]
From some debug info I got, process 1468 seems to relate to node rank 0 (r33i0n0),
while process 1643 seems to originates from node r36i0n14.
But, indeed, none of r33i0n0, r36i0n14 or r36i3n15 exhibits any process like 1468 or 1643,
while process 18992 is indeed the master one on r36i3n15.
Thanks, Best, G.
On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:
Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
this error message, right at startup :
mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[13816,0],209]
and the whole job is going to spin for an undefined period, without crashing/aborting.
What could be the culprit please ?
Is there a workaround ?
Which parameter is to be tuned ?
Thanks in advance for any help, Best, G.