This one seems fine, too.
Note that it should always be possible to read at least the current thread's /proc data. In my workaround, should I run out of retries I default to hwloc_get_last_cpu_location(... HWLOC_CPUBIND_THREAD) -- since presumably that can't fail and the result is technically valid given hwloc_get_last_cpu_location() semantics (it reads state that's inherently transient).
On Apr 23, 2012, at 7:53 AM, Brice Goglin wrote:
On 21/04/2012 23:36, Vlad wrote:
On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:
On 21/04/2012 23:08,
I use hwloc-1.4.1 stable on Red Hat 5 and
am seeing a possible concurrency issue not covered by
the "Thread Safety" guidelines:
- I start a small number (4) of threads, each of
which does some work and periodically executes
hwloc_get_last_cpu_location() with HWLOC_CPUBIND_PROCESS
- occasionally, one or two of those threads will see
the call fail with ENOSYS (even though the same call has
already executed successfully a number of times)
These errors are transient and seem to occur only
when some of the threads in the group are terminating.
I've skimmed through the implementation in
topology-linux.c and it seems plausible to me that the
errors could be caused by failure to read /proc state
"atomically" in the presence of concurrent thread
Of course, the latter is hard (impossible ?) to do
because the state always changes and a snapshot can only
be obtained with a single read() (which in turn would
require knowing how many thread entries to expect in
advance). However, returning ENOSYS in such cases does
not seems intended but rather a flaw in retry logic.
Similar issues may be present with other API methods
that rely on hwloc_linux_foreach_proc_tid() or hwloc_linux_get_proc_tids().
Can you try the attached patch? It doesn't abort the loop
immediately on per-tid errors anymore. This may work better
when threads disappear. I don't remember if the retry logic
was written while thinking about adding threads only or
about adding and removing threads.
If the patch doesn't help, can you send your code to help
Will try this within a day or two. At the moment I am
simply using a retry loop on ENOSYS and usually no more than
one retry is needed.
Here's a possibly better patch. It lets the retry logic happen
before checking whether we should return ENOSYS and friends.
hwloc-users mailing list