Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux
From: Vlad (vlad_at_[hidden])
Date: 2012-04-23 09:56:42


Ah. I've just tested the first patch. It worked.

On Apr 23, 2012, at 7:53 AM, Brice Goglin wrote:

> On 21/04/2012 23:36, Vlad wrote:
>>
>>
>>
>> On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:
>>
>>> On 21/04/2012 23:08, Vlad wrote:
>>>>
>>>> Greetings,
>>>>
>>>> I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible concurrency issue not covered by the "Thread Safety" guidelines:
>>>>
>>>> - I start a small number (4) of threads, each of which does some work and periodically executes hwloc_get_last_cpu_location() with HWLOC_CPUBIND_PROCESS
>>>> - occasionally, one or two of those threads will see the call fail with ENOSYS (even though the same call has already executed successfully a number of times)
>>>>
>>>> These errors are transient and seem to occur only when some of the threads in the group are terminating. I've skimmed through the implementation in topology-linux.c and it seems plausible to me that the errors could be caused by failure to read /proc state "atomically" in the presence of concurrent thread starts/exits.
>>>>
>>>> Of course, the latter is hard (impossible ?) to do because the state always changes and a snapshot can only be obtained with a single read() (which in turn would require knowing how many thread entries to expect in advance). However, returning ENOSYS in such cases does not seems intended but rather a flaw in retry logic. Similar issues may be present with other API methods that rely on hwloc_linux_foreach_proc_tid() or hwloc_linux_get_proc_tids().
>>>
>>> Can you try the attached patch? It doesn't abort the loop immediately on per-tid errors anymore. This may work better when threads disappear. I don't remember if the retry logic was written while thinking about adding threads only or about adding and removing threads.
>>>
>>> If the patch doesn't help, can you send your code to help debug things?
>>
>> Will try this within a day or two. At the moment I am simply using a retry loop on ENOSYS and usually no more than one retry is needed.
>>
>
> Here's a possibly better patch. It lets the retry logic happen before checking whether we should return ENOSYS and friends.
>
> Brice
>
> <fix_tids.patch>_______________________________________________
> hwloc-users mailing list
> hwloc-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users