On Dec 15, 2010, at 10:14 AM, Gilbert Grosdidier wrote:

Bonjour Ralph,

 Thanks for taking time to help me.

Le 15 déc. 10 à 16:27, Ralph Castain a écrit :

It would appear that there is something trying to talk to a socket opened by one of your daemons. At a guess, I would bet the problem is that a prior job left a daemon alive that is talking on the same socket.

gg= At first glance, this could be possible, although I got no evidence
about it when looking for ghost processes of mine on the relevant nodes.


Are you by chance using static ports for the job?

gg= How could I know that ?
Is there an easy way to workaround these static ports ? 
Would it prevent the jobs to collide ghost jobs/processes as suggested below, please ?
I did not spot any info about static ports inside of ompi_info output ... ;-)

It wouldn't happen by default - you would have had to tell us to use static ports by specifying an OOB port range. If you didn't do that (and remember, it could be in a default mca param file!), then the ports are dynamically assigned.


Did you run another job just before this one that might have left a daemon somewhere?

gg= Again, it could be possible that with my many jobs crashing over the cluster,
PBS was unable to clean up the nodes in time before restarting a new one.
But I have no evidence.

 The exact full error message was like this:
[r36i3n15:18992] [[1468,0],254]-[[1468,0],14] mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[1468,1],1643]

 From some debug info I got, process 1468 seems to relate to node rank 0 (r33i0n0),
while process 1643 seems to originates from node r36i0n14.

The "1468,1" is an arbitrary identifier for the overall job. The "1643" indicates that it is an MPI process (rank=1643) within that job that provided the bad identifier.

The "1468,0" identifiers in the early part of the message indicate that the error occurred on a port being used by two ORTE daemons for communication. Somehow, an MPI process (rank=1643) injected a message into that link.

It looks like all the messages are flowing within a single job (all three processes mentioned in the error have the same identifier). Only possibility I can think of is that somehow you are reusing ports - is it possible your system doesn't have enough ports to support all the procs?

I confess I'm a little at a loss  - never seen this problem before, and we run on very large clusters.



 But, indeed, none of r33i0n0, r36i0n14 or r36i3n15 exhibits any process like 1468 or 1643,
while process 18992 is indeed the master one on r36i3n15.

 Thanks,   Best,    G.




On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:

Bonjour,

Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
this error message, right at startup :
mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[13816,0],209]

and the whole job is going to spin for an undefined period, without crashing/aborting.

What could be the culprit please ?
Is there a workaround ?
Which parameter is to be tuned ?

Thanks in advance for any help,    Best,    G.






_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users