Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-04-23 07:53:00


On 21/04/2012 23:36, Vlad wrote:
>
>
> On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:
>
>> On 21/04/2012 23:08, Vlad wrote:
>>> Greetings,
>>>
>>> I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible
>>> concurrency issue not covered by the "Thread Safety" guidelines:
>>>
>>> - I start a small number (4) of threads, each of which does some
>>> work and periodically executes hwloc_get_last_cpu_location() with
>>> HWLOC_CPUBIND_PROCESS
>>> - occasionally, one or two of those threads will see the call fail
>>> with ENOSYS (even though the same call has already executed
>>> successfully a number of times)
>>>
>>> These errors are transient and seem to occur only when some of the
>>> threads in the group are terminating. I've skimmed through the
>>> implementation in topology-linux.c and it seems plausible to me that
>>> the errors could be caused by failure to read /proc state
>>> "atomically" in the presence of concurrent thread starts/exits.
>>>
>>> Of course, the latter is hard (impossible ?) to do because the state
>>> always changes and a snapshot can only be obtained with a single
>>> read() (which in turn would require knowing how many thread entries
>>> to expect in advance). However, returning ENOSYS in such cases does
>>> not seems intended but rather a flaw in retry logic. Similar issues
>>> may be present with other API methods that rely on
>>> hwloc_linux_foreach_proc_tid() orhwloc_linux_get_proc_tids().
>>
>> Can you try the attached patch? It doesn't abort the loop immediately
>> on per-tid errors anymore. This may work better when threads
>> disappear. I don't remember if the retry logic was written while
>> thinking about adding threads only or about adding and removing threads.
>>
>> If the patch doesn't help, can you send your code to help debug things?
>
> Will try this within a day or two. At the moment I am simply using a
> retry loop on ENOSYS and usually no more than one retry is needed.
>

Here's a possibly better patch. It lets the retry logic happen before
checking whether we should return ENOSYS and friends.

Brice