Bonjour Ralph,Thanks for taking time to help me.Le 15 déc. 10 à 16:27, Ralph Castain a écrit :It would appear that there is something trying to talk to a socket opened by one of your daemons. At a guess, I would bet the problem is that a prior job left a daemon alive that is talking on the same socket.gg= At first glance, this could be possible, although I got no evidenceabout it when looking for ghost processes of mine on the relevant nodes.
Are you by chance using static ports for the job?gg= How could I know that ?Is there an easy way to workaround these static ports ?Would it prevent the jobs to collide ghost jobs/processes as suggested below, please ?I did not spot any info about static ports inside of ompi_info output ... ;-)
Did you run another job just before this one that might have left a daemon somewhere?gg= Again, it could be possible that with my many jobs crashing over the cluster,PBS was unable to clean up the nodes in time before restarting a new one.But I have no evidence.The exact full error message was like this:[r36i3n15:18992] [[1468,0],254]-[[1468,0],14] mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[1468,1],1643]From some debug info I got, process 1468 seems to relate to node rank 0 (r33i0n0),while process 1643 seems to originates from node r36i0n14.
_______________________________________________But, indeed, none of r33i0n0, r36i0n14 or r36i3n15 exhibits any process like 1468 or 1643,while process 18992 is indeed the master one on r36i3n15.Thanks, Best, G.
On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:Bonjour,Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I gotthis error message, right at startup :mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[13816,0],209]and the whole job is going to spin for an undefined period, without crashing/aborting.What could be the culprit please ?Is there a workaround ?Which parameter is to be tuned ?Thanks in advance for any help, Best, G.
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users