Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] opal_os_dirpath_create: Error: Unable to create the, sub-directory
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-02-03 18:44:13


On Feb 3, 2014, at 2:01 PM, Eric Chamberland <Eric.Chamberland_at_[hidden]> wrote:

> Hi Ralph,
>
> On 02/03/2014 04:20 PM, Ralph Castain wrote:
>> On Feb 3, 2014, at 1:13 PM, Eric Chamberland <Eric.Chamberland_at_[hidden]> wrote:
>>
>>> On 02/03/2014 03:59 PM, Ralph Castain wrote:
>>>> Very strange - even if you kill the job with SIGTERM, or have processes that segfault, OMPI should clean itself up and remove those session directories. Granted, the 1.6 series isn't as good about doing so as the 1.7 series, but it at least to-date has done pretty well.
>>> Ok, one more information here that may matter: All sequential tests are launched *without* mpiexec... I don't know if the "cleanup" phase is done by mpiexec or the binaries...
>> Ah, yes that would be a source of the problem! We can't guarantee cleanup if you just kill the procs or they segfault *unless* mpiexec is used to launch the job. What are you using to launch? Most resource managers provide an "epilog" capability for precisely this purpose as all MPIs would display the same issue.
> For the sequential jobs, we just launch the tests on the "command line"... no resource manager is ever used. For the jobs which requires more than 1 process, we have "mpiexec -n ..." added to the command line...

Understood. FWIW, if those sequential jobs call "MPI_Init", then they will create a session directory tree. I've been removing that in the 1.7 series so it only gets created when needed, but not in the 1.6 series.

>
>>> which should delete files that shouldn't exists... ;-)
>>>
>>> But, IMHO, I still think OpenMPI should "choose" another directory name if it can't create it because a poor file exists!
>> We could do that - but now we get into the bottomless pit of trying every possible combination of directory names, and ensuring that every process comes up with the same answer! Remember, the session dir is where the shared memory regions rendezvous, so every process on a node would have to find the same place
> ok. Just for my knowledge: that means if I launch 2 processes on a single node and they have to communicate, they will do it by the files in /tmp?

They won't communicate via the files - they just use the files as a rendezvous point to exchange shared memory region pointers.

>
>>> How can all users be aware that they have to cleanup such files?
>> Given how long 1.6.x has been out there, and that this is about the only time I've heard of a problem, I'm not sure this is a general enough issue to merit the concern
> Ok. I did just verified on 8 other computers/architectures that are running the same tests: there is only 1 which have files in the directory level of /tmp/openmpi-sessions-${USER}*
> Since we do that kind of testing since many years, I also agree it is not a widespread issue... But it just occured 2 times in the last 3 days!!! :-/

Bummer :-(

>>
>>> Maybe a good compromise would be to have the error message to tell there is a file with the same name of the directory chosen?
>> I can make that change - good suggestion.
> ok, thanks!
>
>>
>>> Or add a new entry to the FAQ to help users find the workaround you proposed... ;-)
>> we can try to do that too
>
> If I may suggest to test the behavior of 1.7.x... what about this: Have a test case that creates a bunch of files (from 0 to 65536) in /tmp/openmpi-sessions-${USER}... before launching an executable without mpirun... >:)

Ick - it will actually only conflict if/when the pid's wrap, so it's a pretty rare issue.

>
> Anyway, thanks a lot!
>
> Eric
>