Open MPI logo

Hardware Locality Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [hwloc-users] possible concurrency issue with reading /proc data on Linux
From: Vlad (vlad_at_[hidden])
Date: 2012-04-21 17:36:29


On Apr 21, 2012, at 5:26 PM, Brice Goglin wrote:

> On 21/04/2012 23:08, Vlad wrote:
>>
>> Greetings,
>>
>> I use hwloc-1.4.1 stable on Red Hat 5 and am seeing a possible concurrency issue not covered by the "Thread Safety" guidelines:
>>
>> - I start a small number (4) of threads, each of which does some work and periodically executes hwloc_get_last_cpu_location() with HWLOC_CPUBIND_PROCESS
>> - occasionally, one or two of those threads will see the call fail with ENOSYS (even though the same call has already executed successfully a number of times)
>>
>> These errors are transient and seem to occur only when some of the threads in the group are terminating. I've skimmed through the implementation in topology-linux.c and it seems plausible to me that the errors could be caused by failure to read /proc state "atomically" in the presence of concurrent thread starts/exits.
>>
>> Of course, the latter is hard (impossible ?) to do because the state always changes and a snapshot can only be obtained with a single read() (which in turn would require knowing how many thread entries to expect in advance). However, returning ENOSYS in such cases does not seems intended but rather a flaw in retry logic. Similar issues may be present with other API methods that rely on hwloc_linux_foreach_proc_tid() or hwloc_linux_get_proc_tids().
>
> Can you try the attached patch? It doesn't abort the loop immediately on per-tid errors anymore. This may work better when threads disappear. I don't remember if the retry logic was written while thinking about adding threads only or about adding and removing threads.
>
> If the patch doesn't help, can you send your code to help debug things?

Will try this within a day or two. At the moment I am simply using a retry loop on ENOSYS and usually no more than one retry is needed.

>
>> An alternative explanation could be that the retry logic is correct but the implementation relies on readdir(), which is documented to not be thread-safe: http://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
>>
>
> I don't this can happen. Your threads should not be accessing the same DIR stream here.

You are probably correct. I was thinking of this code from https://svn.open-mpi.org/trac/hwloc/browser/trunk/src/topology-linux.c:

425 hwloc_linux_get_proc_tids(DIR *taskdir, unsigned *nr_tidsp, pid_t ** tidsp)
426 {
427 struct dirent *dirent;
428 unsigned nr_tids = 0;
429 unsigned max_tids = 32;
430 pid_t *tids;
431 struct stat sb;
432
433 /* take the number of links as a good estimate for the number of tids */
434 if (fstat(dirfd(taskdir), &sb) == 0)
435 max_tids = sb.st_nlink;
436
437 tids = malloc(max_tids*sizeof(pid_t));
438 if (!tids) {
439 errno = ENOMEM;
440 return -1;
441 }
442
443 rewinddir(taskdir);
444
445 while ((dirent = readdir(taskdir)) != NULL) {

"taskdir" here is /proc/<pid>/task, correct? In which case the threads will be doing readdir() on the same DIR stream...

>
> Thanks
> Brice
>
> <fix_tids.patch>_______________________________________________
> hwloc-users mailing list
> hwloc-users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users