Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] torque pbs behaviour...
From: Klymak Jody (jklymak_at_[hidden])
Date: 2009-08-11 10:23:33


On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:

> Sigh - too early in the morning for this old brain, I fear...
>
> You are right - the ranks are fine, and local rank doesn't matter.
> It sounds like a problem where the TCP messaging is getting a
> message ack'd from someone other than the process that was supposed
> to recv the message. This should cause us to abort, but we were just
> talking on the phone that the abort procedure may not be working
> correctly. Or it could be (as Jeff suggests) that the version
> mismatch is also preventing us from properly aborting too.
>
> So I fear we are back to trying to find these other versions on your
> nodes...

Well, the old version is still on the nodes (in /usr/lib as default
for OS X)...

I can try and clean those all out by hand but I'm still confused why
the old version would be used - how does openMPI find the right library?

Note again, that I get these MCA warnings on the server when just
running ompi_info and I *have* cleaned out /usr/lib on the server. So
I really don't understand how on the server I can still have a library
issue. Is there a way to trace at runtime what library an executable
is dynamically linking to? Can I rebuild openmpi statically?

Thanks, Jody

>
>
> On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody <jklymak_at_[hidden]> wrote:
>
> On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
>
> The reason your job is hanging is sitting in the orte-ps output. You
> have multiple processes declaring themselves to be the same MPI
> rank. That definitely won't work.
>
> Its the "local rank" if that makes any difference...
>
> Any thoughts on this output?
>
>
> [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:
> 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected
> process identifier [[61029,1],3]
>
> The question is why is that happening? We use Torque all the time,
> so we know that the basic support is correct. It -could- be related
> to lib confusion, but I can't tell for sure.
>
> Just to be clear, this is not going through torque at this point.
> Its just vanilla ssh, for which this code worked with 1.1.5.
>
>
>
> Can you rebuild OMPI with --enable-debug, and rerun the job with the
> following added to your cmd line?
>
> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
>
> Working on this...
>
> Thanks, Jody
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users