On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
> The reason your job is hanging is sitting in the orte-ps output. You
> have multiple processes declaring themselves to be the same MPI
> rank. That definitely won't work.
Its the "local rank" if that makes any difference...
Any thoughts on this output?
[xserve03.local][[61029,1],4][btl_tcp_endpoint.c:
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process
identifier [[61029,1],3]
> The question is why is that happening? We use Torque all the time,
> so we know that the basic support is correct. It -could- be related
> to lib confusion, but I can't tell for sure.
Just to be clear, this is not going through torque at this point. Its
just vanilla ssh, for which this code worked with 1.1.5.
> Can you rebuild OMPI with --enable-debug, and rerun the job with the
> following added to your cmd line?
>
> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
Working on this...
Thanks, Jody
|