On 21/04/2012
23:08, Vlad wrote:
Greetings,
I use
hwloc-1.4.1 stable on Red Hat 5 and am seeing
a possible concurrency issue not covered by
the "Thread Safety" guidelines:
- I start a small number (4) of
threads, each of which does some work and
periodically executes
hwloc_get_last_cpu_location() with
HWLOC_CPUBIND_PROCESS
- occasionally, one or two of those threads
will see the call fail with ENOSYS (even
though the same call has already executed
successfully a number of times)
These errors are transient and seem to
occur only when some of the threads in the
group are terminating. I've skimmed through
the implementation in topology-linux.c and it
seems plausible to me that the errors could be
caused by failure to read /proc state
"atomically" in the presence of concurrent
thread starts/exits.
Of course, the latter is hard (impossible
?) to do because the state always changes and
a snapshot can only be obtained with a single
read() (which in turn would require knowing
how many thread entries to expect in advance).
However, returning ENOSYS in such cases does
not seems intended but rather a flaw in retry
logic. Similar issues may be present with
other API methods that rely on hwloc_linux_foreach_proc_tid() or hwloc_linux_get_proc_tids().
Can you try the attached patch? It doesn't abort
the loop immediately on per-tid errors anymore.
This may work better when threads disappear. I
don't remember if the retry logic was written
while thinking about adding threads only or about
adding and removing threads.
If the patch doesn't help, can you send your code
to help debug things?