Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] wrong core binding by openmpi-1.5.5
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-04-11 07:55:40


Ouch - finally figured out what happened. Jeff and I did indeed address this problem a few weeks ago. There were some changes required in a couple of places to make it all work, so we did the work in a Mercurial branch Jeff set up.

Unfortunately, I think he got distracted by the MPI Forum shortly thereafter, and then got engulfed by other things. The work appears complete, but I can't find a record of it actually being committed to the 1.5 branch. Could be he intended it for 1.6.

I'll have to bug him when he gets back next week and see what happened, and his plans. Sorry for the mixup.
Ralph

On Apr 11, 2012, at 3:15 AM, Brice Goglin wrote:

> Here's a better patch. Still only compile tested :)
> Brice
>
>
> Le 11/04/2012 10:36, Brice Goglin a écrit :
>>
>> A quick look at the code seems to confirm my feeling. get/set_module()
>> callbacks manipulate arrays of logical indexes, and they do not convert
>> them back to physical indexes before binding.
>>
>> Here's a quick patch that may help. Only compile tested...
>>
>> Brice
>>
>>
>>
>> Le 11/04/2012 09:49, Brice Goglin a écrit :
>>> Le 11/04/2012 09:06, tmishima_at_[hidden] a écrit :
>>>> Hi, Brice.
>>>>
>>>> I installed the latest hwloc-1.4.1.
>>>> Here is the output of lstopo -p.
>>>>
>>>> [root_at_node03 bin]# ./lstopo -p
>>>> Machine (126GB)
>>>> Socket P#0 (32GB)
>>>> NUMANode P#0 (16GB) + L3 (5118KB)
>>>> L2 (512KB) + L1 (64KB) + Core P#0 + PU P#0
>>>> L2 (512KB) + L1 (64KB) + Core P#1 + PU P#4
>>>> L2 (512KB) + L1 (64KB) + Core P#2 + PU P#8
>>>> L2 (512KB) + L1 (64KB) + Core P#3 + PU P#12
>>> Ok then the cpuset of this numanode is 1111.
>>>
>>>>> [node03.cluster:21706] [[55518,0],0] odls:default:fork binding child
>>>>> [[55518,1],0] to cpus 1111
>>> So openmpi 1.5.4 is correct.
>>>
>>>>> [node03.cluster:04706] [[40566,0],0] odls:default:fork binding child
>>>>> [[40566,1],0] to cpus 000f
>>> And openmpi 1.5.5 is indeed wrong.
>>>
>>> Random guess: 000f is the bitmask made of hwloc *logical* indexes. hwloc
>>> cpusets (used for binding) are internally made of hwloc *physical*
>>> indexes (1111 here).
>>>
>>> Jeff, Ralph:
>>> How does OMPI 1.5.5 build hwloc cpusets for binding? Are you doing
>>> bitmap operations on hwloc object cpusets?
>>> If yes, I don't know what's going wrong here.
>>> If no, are you building hwloc cpusets manually by setting individual
>>> bits from object indexes? If yes, you must use *physical* indexes to do so.
>>>
>>> Brice
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> <try2.patch>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users