Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-04-23 10:23:21


On 23/04/2012 16:13, Vlad wrote:
> This one seems fine, too.
>
> Note that it should always be possible to read at least the current
> thread's /proc data.

This code also works when the task reading the cpubinding/location is
not part of the process it looks at.

Brice

> In my workaround, should I run out of retries I default to
> hwloc_get_last_cpu_location(... HWLOC_CPUBIND_THREAD) -- since
> presumably that can't fail and the result is technically valid given
> hwloc_get_last_cpu_location() semantics (it reads state that's
> inherently transient).
>
> On Apr 23, 2012, at 7:53 AM, Brice Goglin wrote:
>
>> On 21/04/2012 23:36, Vlad wrote:
>>>
>>>
>>> On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:
>>>
>>>> On 21/04/2012 23:08, Vlad wrote:
>>>>> Greetings,
>>>>>
>>>>> I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible
>>>>> concurrency issue not covered by the "Thread Safety" guidelines:
>>>>>
>>>>> - I start a small number (4) of threads, each of which does some
>>>>> work and periodically executes hwloc_get_last_cpu_location() with
>>>>> HWLOC_CPUBIND_PROCESS
>>>>> - occasionally, one or two of those threads will see the call fail
>>>>> with ENOSYS (even though the same call has already executed
>>>>> successfully a number of times)
>>>>>
>>>>> These errors are transient and seem to occur only when some of the
>>>>> threads in the group are terminating. I've skimmed through the
>>>>> implementation in topology-linux.c and it seems plausible to me
>>>>> that the errors could be caused by failure to read /proc state
>>>>> "atomically" in the presence of concurrent thread starts/exits.
>>>>>
>>>>> Of course, the latter is hard (impossible ?) to do because the
>>>>> state always changes and a snapshot can only be obtained with a
>>>>> single read() (which in turn would require knowing how many thread
>>>>> entries to expect in advance). However, returning ENOSYS in such
>>>>> cases does not seems intended but rather a flaw in retry logic.
>>>>> Similar issues may be present with other API methods that rely on
>>>>> hwloc_linux_foreach_proc_tid() orhwloc_linux_get_proc_tids().
>>>>
>>>> Can you try the attached patch? It doesn't abort the loop
>>>> immediately on per-tid errors anymore. This may work better when
>>>> threads disappear. I don't remember if the retry logic was written
>>>> while thinking about adding threads only or about adding and
>>>> removing threads.
>>>>
>>>> If the patch doesn't help, can you send your code to help debug things?
>>>
>>> Will try this within a day or two. At the moment I am simply using a
>>> retry loop on ENOSYS and usually no more than one retry is needed.
>>>
>>
>> Here's a possibly better patch. It lets the retry logic happen before
>> checking whether we should return ENOSYS and friends.
>>
>> Brice
>>
>> <fix_tids.patch>_______________________________________________
>> hwloc-users mailing list
>> hwloc-users_at_[hidden] <mailto:hwloc-users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
>
> _______________________________________________
> hwloc-users mailing list
> hwloc-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users