Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem when mpi_paffinity_alone is set to 1
From: Camille Coti (coti_at_[hidden])
Date: 2008-08-22 11:47:04


inria_at_behemoth:~$ uname -a
Linux behemoth 2.6.5-7.283-sn2 #1 SMP Wed Nov 29 16:55:53 UTC 2006 ia64
ia64 ia64 GNU/Linux

I am not sure the output of plpa-info --topo gives good news...

inria_at_behemoth:~$ plpa-info --topo
Kernel affinity support: yes
Kernel topology support: no
Number of processor sockets: unknown
Kernel topology not supported -- cannot show topology information

Camille

Jeff Squyres a écrit :
> Camile --
>
> Can you also send the output of "uname -a"?
>
> Also, just to be absoultely sure, let's check that PLPA is doing the
> Right thing here (we don't think this is problem, but it's worth
> checking). Grab the latest beta:
>
> http://www.open-mpi.org/software/plpa/v1.2/
>
> It's a very small package and easy to install under your $HOME (or
> whatever).
>
> Can you send the output of "plpa-info --topo"?
>
>
>
> On Aug 22, 2008, at 7:00 AM, Camille Coti wrote:
>
>>
>> Actually, I have tried with several versions, since you were working
>> on the affinity thing. I have tried with revision 19103 a couple a
>> weeks ago, the problem was already there.
>>
>> Part of /proc/cpuinfo is below:
>> processor : 0
>> vendor : GenuineIntel
>> arch : IA-64
>> family : Itanium 2
>> model : 0
>> revision : 7
>> archrev : 0
>> features : branchlong
>> cpu number : 0
>> cpu regs : 4
>> cpu MHz : 900.000000
>> itc MHz : 900.000000
>> BogoMIPS : 1325.40
>> siblings : 1
>>
>> The machine is a 60-way Altix machine, so you have 60 times this
>> information in /proc/cpuinfo (yes, 60, not 64).
>>
>> Camille
>>
>>
>>
>> Ralph Castain a écrit :
>>> I believe I have found the problem, and it may indeed relate to the
>>> change in paffinity. By any chance, do you have unfilled sockets on
>>> that machine? Could you provide the output from something like "cat
>>> /proc/cpuinfo" (or the equiv for your system) so we could see what
>>> physical processors and sockets are present?
>>> If I'm correct as to the problem, here is the issue. OMPI has (until
>>> now) always assumed that the #logical processors (or sockets, or
>>> cores) was the same as the #physical processors (or sockets, or
>>> cores). As a result, several key subsystems were written without
>>> making any distinction as to which (logical vs physical) they were
>>> referring to. This was no problem until we recently encountered
>>> systems with "holes" in their system - a processor turned "off", or a
>>> socket unpopulated, etc.
>>> In this case, the local processor id no longer matches the physical
>>> processor id (ditto for sockets and cores). We adjusted the paffinity
>>> subsystem to deal with it - took much more effort than we would have
>>> liked, and exposed lots of inconsistencies in how the base operating
>>> systems handle such situations.
>>> Unfortunately, having gotten that straightened out, it is possible
>>> that you have uncovered a similar inconsistency in logical vs
>>> physical in another subsystem. I have asked better eyes than mine to
>>> take a look at that now to confirm - if so, it could take us a little
>>> while to fix.
>>> My request for info was aimed at helping us to determine why your
>>> system is seeing this problem, but our tests didn't. We have tested
>>> the revised paffinity on both completely filled and on at least one
>>> system with "holes", but differences in OS levels, processor types,
>>> etc could have caused our tests to pass while your system fails. I'm
>>> particularly suspicious of the old kernel you are running and how our
>>> revised code will handle it.
>>> For now, I would suggest you work with revisions lower than r19391 -
>>> could you please confirm that r19390 or earlier works?
>>> Thanks
>>> Ralph
>>> On Aug 22, 2008, at 7:21 AM, Camille Coti wrote:
>>>>
>>>> OK, thank you!
>>>>
>>>> Camille
>>>>
>>>> Ralph Castain a écrit :
>>>>> Okay, I'll look into it. I suspect the problem is due to the
>>>>> redefinition of the paffinity API to clarify physical vs logical
>>>>> processors - more than likely, the maffinity interface suffers from
>>>>> the same problem we had to correct over there.
>>>>> We'll report back later with an estimate of how quickly this can be
>>>>> fixed.
>>>>> Thanks
>>>>> Ralph
>>>>> On Aug 22, 2008, at 7:03 AM, Camille Coti wrote:
>>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> I compiled a clean checkout from the trunk (r19392), the problem
>>>>>> is still the same.
>>>>>>
>>>>>> Camille
>>>>>>
>>>>>>
>>>>>> Ralph Castain a écrit :
>>>>>>> Hi Camille
>>>>>>> What OMPI version are you using? We just changed the paffinity
>>>>>>> module last night, but did nothing to maffinity. However, it is
>>>>>>> possible that the maffinity framework makes some calls into
>>>>>>> paffinity that need to adjust.
>>>>>>> So version number would help a great deal in this case.
>>>>>>> Thanks
>>>>>>> Ralph
>>>>>>> On Aug 22, 2008, at 5:23 AM, Camille Coti wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I am trying to run applications on a shared-memory machine. For
>>>>>>>> the moment I am just trying to run tests on point-to-point
>>>>>>>> communications (a trivial token ring) and collective operations
>>>>>>>> (from the SkaMPI tests suite).
>>>>>>>>
>>>>>>>> It runs smoothly if mpi_paffinity_alone is set to 0. For a
>>>>>>>> number of processes which is larger than about 10, global
>>>>>>>> communications just don't seem possible. Point-to-point
>>>>>>>> communications seem to be OK.
>>>>>>>>
>>>>>>>> But when I specify --mca mpi_paffinity_alone 1 in my command
>>>>>>>> line, I get the following error:
>>>>>>>>
>>>>>>>> mbind: Invalid argument
>>>>>>>>
>>>>>>>> I looked into the code of maffinity/libnuma, and found out the
>>>>>>>> error comes from
>>>>>>>>
>>>>>>>> numa_setlocal_memory(segments[i].mbs_start_addr,
>>>>>>>> segments[i].mbs_len);
>>>>>>>>
>>>>>>>> in maffinity_libnuma_module.c.
>>>>>>>>
>>>>>>>> The machine I am using is a Linux box running a 2.6.5-7 kernel.
>>>>>>>>
>>>>>>>> Has anyone experienced a similar problem?
>>>>>>>>
>>>>>>>> Camille
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>