Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
From: Camille Coti (coti_at_[hidden])
Date: 2008-08-22 10:00:53


Actually, I have tried with several versions, since you were working on
the affinity thing. I have tried with revision 19103 a couple a weeks
ago, the problem was already there.

Part of /proc/cpuinfo is below:
processor : 0
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 900.000000
itc MHz : 900.000000
BogoMIPS : 1325.40
siblings : 1

The machine is a 60-way Altix machine, so you have 60 times this
information in /proc/cpuinfo (yes, 60, not 64).

Camille

Ralph Castain a écrit :
> I believe I have found the problem, and it may indeed relate to the
> change in paffinity. By any chance, do you have unfilled sockets on that
> machine? Could you provide the output from something like "cat
> /proc/cpuinfo" (or the equiv for your system) so we could see what
> physical processors and sockets are present?
>
> If I'm correct as to the problem, here is the issue. OMPI has (until
> now) always assumed that the #logical processors (or sockets, or cores)
> was the same as the #physical processors (or sockets, or cores). As a
> result, several key subsystems were written without making any
> distinction as to which (logical vs physical) they were referring to.
> This was no problem until we recently encountered systems with "holes"
> in their system - a processor turned "off", or a socket unpopulated, etc.
>
> In this case, the local processor id no longer matches the physical
> processor id (ditto for sockets and cores). We adjusted the paffinity
> subsystem to deal with it - took much more effort than we would have
> liked, and exposed lots of inconsistencies in how the base operating
> systems handle such situations.
>
> Unfortunately, having gotten that straightened out, it is possible that
> you have uncovered a similar inconsistency in logical vs physical in
> another subsystem. I have asked better eyes than mine to take a look at
> that now to confirm - if so, it could take us a little while to fix.
>
> My request for info was aimed at helping us to determine why your system
> is seeing this problem, but our tests didn't. We have tested the revised
> paffinity on both completely filled and on at least one system with
> "holes", but differences in OS levels, processor types, etc could have
> caused our tests to pass while your system fails. I'm particularly
> suspicious of the old kernel you are running and how our revised code
> will handle it.
>
> For now, I would suggest you work with revisions lower than r19391 -
> could you please confirm that r19390 or earlier works?
>
> Thanks
> Ralph
>
> On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:
>
>>
>> OK, thank you!
>>
>> Camille
>>
>> Ralph Castain a écrit :
>>> Okay, I'll look into it. I suspect the problem is due to the
>>> redefinition of the paffinity API to clarify physical vs logical
>>> processors - more than likely, the maffinity interface suffers from
>>> the same problem we had to correct over there.
>>> We'll report back later with an estimate of how quickly this can be
>>> fixed.
>>> Thanks
>>> Ralph
>>> On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:
>>>>
>>>> Ralph,
>>>>
>>>> I compiled a clean checkout from the trunk (r19392), the problem is
>>>> still the same.
>>>>
>>>> Camille
>>>>
>>>>
>>>> Ralph Castain a écrit :
>>>>> Hi Camille
>>>>> What OMPI version are you using? We just changed the paffinity
>>>>> module last night, but did nothing to maffinity. However, it is
>>>>> possible that the maffinity framework makes some calls into
>>>>> paffinity that need to adjust.
>>>>> So version number would help a great deal in this case.
>>>>> Thanks
>>>>> Ralph
>>>>> On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to run applications on a shared-memory machine. For
>>>>>> the moment I am just trying to run tests on point-to-point
>>>>>> communications (a trivial token ring) and collective operations
>>>>>> (from the SkaMPI tests suite).
>>>>>>
>>>>>> It runs smoothly if mpi_paffinity_alone is set to 0. For a number
>>>>>> of processes which is larger than about 10, global communications
>>>>>> just don't seem possible. Point-to-point communications seem to be
>>>>>> OK.
>>>>>>
>>>>>> But when I specify --mca mpi_paffinity_alone 1 in my command
>>>>>> line, I get the following error:
>>>>>>
>>>>>> mbind: Invalid argument
>>>>>>
>>>>>> I looked into the code of maffinity/libnuma, and found out the
>>>>>> error comes from
>>>>>>
>>>>>> numa_setlocal_memory(segments[i].mbs_start_addr,
>>>>>> segments[i].mbs_len);
>>>>>>
>>>>>> in maffinity_libnuma_module.c.
>>>>>>
>>>>>> The machine I am using is a Linux box running a 2.6.5-7 kernel.
>>>>>>
>>>>>> Has anyone experienced a similar problem?
>>>>>>
>>>>>> Camille
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users