Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] torque pbs behaviour...
From: Klymak Jody (jklymak_at_[hidden])
Date: 2009-08-11 10:23:33


On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:

> Sigh - too early in the morning for this old brain, I fear...
>
> You are right - the ranks are fine, and local rank doesn't matter.
> It sounds like a problem where the TCP messaging is getting a
> message ack'd from someone other than the process that was supposed
> to recv the message. This should cause us to abort, but we were just
> talking on the phone that the abort procedure may not be working
> correctly. Or it could be (as Jeff suggests) that the version
> mismatch is also preventing us from properly aborting too.
>
> So I fear we are back to trying to find these other versions on your
> nodes...

Well, the old version is still on the nodes (in /usr/lib as default
for OS X)...

I can try and clean those all out by hand but I'm still confused why
the old version would be used - how does openMPI find the right library?

Note again, that I get these MCA warnings on the server when just
running ompi_info and I *have* cleaned out /usr/lib on the server. So
I really don't understand how on the server I can still have a library
issue. Is there a way to trace at runtime what library an executable
is dynamically linking to? Can I rebuild openmpi statically?

Thanks, Jody

>
>
> On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody <jklymak_at_[hidden]> wrote:
>
> On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
>
> The reason your job is hanging is sitting in the orte-ps output. You
> have multiple processes declaring themselves to be the same MPI
> rank. That definitely won't work.
>
> Its the "local rank" if that makes any difference...
>
> Any thoughts on this output?
>
>
> [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:
> 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected
> process identifier [[61029,1],3]
>
> The question is why is that happening? We use Torque all the time,
> so we know that the basic support is correct. It -could- be related
> to lib confusion, but I can't tell for sure.
>
> Just to be clear, this is not going through torque at this point.
> Its just vanilla ssh, for which this code worked with 1.1.5.
>
>
>
> Can you rebuild OMPI with --enable-debug, and rerun the job with the
> following added to your cmd line?
>
> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
>
> Working on this...
>
> Thanks, Jody
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users