Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bug in openmpi 1.5.4 in paffinity
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2011-09-06 10:34:56


I think we'll have problems on all machines with Magny-Cours *and*
cpuset/cgroups restricting the number of available processors. Not sure
how widely common this is.

I just checked the hwloc v1.2 branch changelog. Nothing really matters
for OMPI except the patch I sent below (commit v1.2_at_3767). The first
patch I sent about this bug report (commit v1.2_at_3772) is not strictly
needed because r3767 hides it.

All other v1.2 commits after 1.2.1 fix things outside of the core hwloc
lib, or bugs that won't happen in OMPI. So doing a 1.2.2 just for OMPI
may be a bit overkill. I think you could just apply r3767 (and maybe
r3772) on top of 1.2.1 in OMPI. Then I don't know if it really deserves
an emergency OMPI 1.5.x release.

I was going to write a mail about the roadmap to hwloc-devel, stay
tuned, we may change our mind for 1.2.2 :)

Brice

Le 06/09/2011 16:17, Jeff Squyres a écrit :
> Brice --
>
> Should I apply that patch to the OMPI 1.5 series, or should we do a hwloc 1.2.2 release? I.e., is this broken on all AMD/Magny-Cours machines?
>
> Should I also do an emergency OMPI 1.5.x release with (essentially) just this fix? (OMPI 1.5.x currently contains hwloc 1.2.0)
>
>
> On Sep 6, 2011, at 1:43 AM, Brice Goglin wrote:
>
>> Le 05/09/2011 21:29, Brice Goglin a écrit :
>>> Dear Ake,
>>> Could you try the attached patch? It's not optimized, but it's probably
>>> going in the right direction.
>>> (and don't forget to remove the above comment-out if you tried it).
>> Actually, now that I've seen your entire topology, I found out that the
>> real fix is the attached patch. This is actually a Magny-Cours specific
>> problem (having 2 NUMA nodes inside each socket is quite unusual). I've
>> already committed this patch to hwloc trunk and backported to the v1.2
>> branch. It could be applied in OMPI 1.5.5.
>>
>> The patch that I sent earlier is not needed as long as cgroups don't
>> reduce the available memory (your cgroups don't). I'll fix this other
>> bug properly soon.
>>
>> Brice
>>
>> <fix-cgroup-vs-magnycours.patch>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>