Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Users mailing list

Subject: Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-04-21 17:26:07

On 21/04/2012 23:08, Vlad wrote:
> Greetings,
> I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible
> concurrency issue not covered by the "Thread Safety" guidelines:
> - I start a small number (4) of threads, each of which does some work
> and periodically executes hwloc_get_last_cpu_location() with
> - occasionally, one or two of those threads will see the call fail
> with ENOSYS (even though the same call has already executed
> successfully a number of times)
> These errors are transient and seem to occur only when some of the
> threads in the group are terminating. I've skimmed through the
> implementation in topology-linux.c and it seems plausible to me that
> the errors could be caused by failure to read /proc state "atomically"
> in the presence of concurrent thread starts/exits.
> Of course, the latter is hard (impossible ?) to do because the state
> always changes and a snapshot can only be obtained with a single
> read() (which in turn would require knowing how many thread entries to
> expect in advance). However, returning ENOSYS in such cases does not
> seems intended but rather a flaw in retry logic. Similar issues may be
> present with other API methods that rely on
> hwloc_linux_foreach_proc_tid() orhwloc_linux_get_proc_tids().

Can you try the attached patch? It doesn't abort the loop immediately on
per-tid errors anymore. This may work better when threads disappear. I
don't remember if the retry logic was written while thinking about
adding threads only or about adding and removing threads.

If the patch doesn't help, can you send your code to help debug things?

> An alternative explanation could be that the retry logic is correct
> but the implementation relies on readdir(), which is documented to not
> be thread-safe:

I don't this can happen. Your threads should not be accessing the same
DIR stream here.