Sigh - too early in the morning for this old brain, I fear...
You are right - the ranks are fine, and local rank doesn't matter. It sounds like a problem where the TCP messaging is getting a message ack'd from someone other than the process that was supposed to recv the message. This should cause us to abort, but we were just talking on the phone that the abort procedure may not be working correctly. Or it could be (as Jeff suggests) that the version mismatch is also preventing us from properly aborting too.
So I fear we are back to trying to find these other versions on your nodes...
Its the "local rank" if that makes any difference...
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be the same MPI rank. That definitely won't work.
Any thoughts on this output?
[xserve03.local][[61029,1],4][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3]
Just to be clear, this is not going through torque at this point. Its just vanilla ssh, for which this code worked with 1.1.5.The question is why is that happening? We use Torque all the time, so we know that the basic support is correct. It -could- be related to lib confusion, but I can't tell for sure.
Working on this...
Can you rebuild OMPI with --enable-debug, and rerun the job with the following added to your cmd line?
-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
users mailing list