It would appear that there is something trying to talk to a socket opened by one of your daemons. At a guess, I would bet the problem is that a prior job left a daemon alive that is talking on the same socket.
Are you by chance using static ports for the job? Did you run another job just before this one that might have left a daemon somewhere?
On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:
> Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
> this error message, right at startup :
> mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[13816,0],209]
> and the whole job is going to spin for an undefined period, without crashing/aborting.
> What could be the culprit please ?
> Is there a workaround ?
> Which parameter is to be tuned ?
> Thanks in advance for any help, Best, G.
> users mailing list