Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Hang in collectives involving shared memory
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-06-10 13:50:58


I appreciate the input and have captured it in the ticket. Since this
appears to be a NUMA-related issue, the lack of NUMA support in your setup
makes the test difficult to interpret.

I agree, though, that this is likely something peculiar to our particular
setup. Of primary concern is that it might be related to the relatively old
kernel (2.6.18) on these machines. There has been a lot of change since that
kernel was released, and some of those changes may be relevant to this
problem.

Unfortunately, upgrading the kernel will take persuasive argument. We are
going to try and run the reproducers on machines with more modern kernels to
see if we get different behavior.

Please feel free to follow this further on the ticket.

Thanks again!
Ralph

On Wed, Jun 10, 2009 at 11:29 AM, Bogdan Costescu <
Bogdan.Costescu_at_[hidden]> wrote:

> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
> Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
>> you might take a glance at that and offer some thoughts?
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/1944
>>
>
> I wasn't able to reproduce this. I have run with the following setup:
> - OS is Scientific Linux 5.1 with a custom compiled kernel based on
> 2.6.22.19, but (due to circumstances that I can't control):
>
> checking if MCA component maffinity:libnuma can compile... no
>
> - Intel compiler 10.1
> - OpenMPI 1.3.2
> - nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB
> DDR
>
> I've used the platform file that you have provided, but took out the
> references to PanFS and fixed the paths. I've also used the MCA file that
> you have provided.
>
> I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished
> successfully with m=50 several times. This, together with the earlier post
> also describing a negative result, points to a problem related to your
> particular setup...
>
> --
> Bogdan Costescu
>
> IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
> Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
> E-mail: bogdan.costescu_at_[hidden]
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>