Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: [hwloc-users] possible concurrency issue with reading /proc data on Linux
From: Vlad (vlad_at_[hidden])
Date: 2012-04-21 17:08:44


Greetings,

        I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible concurrency issue not covered by the "Thread Safety" guidelines:

- I start a small number (4) of threads, each of which does some work and periodically executes hwloc_get_last_cpu_location() with HWLOC_CPUBIND_PROCESS
- occasionally, one or two of those threads will see the call fail with ENOSYS (even though the same call has already executed successfully a number of times)

These errors are transient and seem to occur only when some of the threads in the group are terminating. I've skimmed through the implementation in topology-linux.c and it seems plausible to me that the errors could be caused by failure to read /proc state "atomically" in the presence of concurrent thread starts/exits.

Of course, the latter is hard (impossible ?) to do because the state always changes and a snapshot can only be obtained with a single read() (which in turn would require knowing how many thread entries to expect in advance). However, returning ENOSYS in such cases does not seems intended but rather a flaw in retry logic. Similar issues may be present with other API methods that rely on hwloc_linux_foreach_proc_tid() or hwloc_linux_get_proc_tids().

An alternative explanation could be that the retry logic is correct but the implementation relies on readdir(), which is documented to not be thread-safe: http://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html

Regards,
Vlad