Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [sge] tight-integration openmpi and sge: opal_os_dirpath_create failure
From: Reuti (reuti_at_[hidden])
Date: 2009-11-10 12:48:19


Am 10.11.2009 um 18:20 schrieb Eloi Gaudry:

> Thanks for your help Reuti,
>
> I'm using a nfs-shared directory (/opt/sge/tmp), exported from the
> master node to all others computing nodes.

It's higly advisable to have the "tmpdir" local on each node. When
you use "cd $TMPDIR" in your jobscript, all is done local on a node
(when your application will just create the scratch file in your
current working directory) which will speed up the computation and
decrease the network traffic. Computing in as shared /opt/sge/tmp is
like computing in each user's home directory.

To avoid that any user can remove someone else's files, the "t" flag
is set like for /tmp: drwxrwxrwt 14 root root 4096 2009-11-10 18:35 /
tmp/

Nevertheless:

> with for /etc/export on server (named moe.fft): /opt/sge
> 192.168.0.0/255.255.255.0(rw,sync,no_subtree_check)
> /etc/fstab on
> client: moe.fft:/opt/
> sge /opt/sge
> nfs rw,bg,soft,timeo=14, 0 0
> Actually, the /opt/sge/tmp directory is 777 across all machines,
> thus all user should be able to create a directory inside.

All access checkings will be applied:

- on the server: what is "ls -d /opt/sge/tmp" showing?
- the one from the export (this seems to be fine)
- the one on the node (i.e., how it's mounted: cat /etc/fstab)

> The issue seems somehow related to the session directory created
> inside /opt/sge/tmp, let's stay /opt/sge/tmp/29.1.smp8.q for
> example for the job 29 on queue smp8.q. This subdirectory of /opt/
> sge/tmp is created with nobody:nogroup drwxr-xr-x permissions...
> which in turn forbids

Did you try to run some simple jobs before the parallel ones - are
these working? The daemons (qmaster and execd) were started as root?

The user is known on the file server, i.e. the machine hosting /opt/sge?

> OpenMPI to create its subtree inside (as OpenMPI won't use
> nobody:nogroup credentials).

In SGE the master process (the one running the job script) will
create the /opt/sge/tmp/29.1.smp8.q and also each started qrsh
inside SGE - all with the same name. What is your definition of the
PE in SGE which you use?

-- Reuti

> Ad Ralph suggested, I checked the SGE configuration, but I haven't
> found anything related to nobody:nogroup configuration so far.
>
> Eloi
>
>
> Reuti wrote:
>> Hi,
>>
>> Am 10.11.2009 um 17:55 schrieb Eloi Gaudry:
>>
>>> Thanks for your help Ralph, I'll double check that.
>>>
>>> As for the error message received, there might be some
>>> inconsistency: "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-
>>> eg_at_charlie_0" is the
>>
>> often /opt/sge is shared across the nodes, while the /tmp
>> (sometimes implemented as /scratch in a partition on its own)
>> should be local on each node.
>>
>> What is the setting of "tmpdir" in your queue definition?
>>
>> If you want to share /opt/sge/tmp, everyone must be able to write
>> into this location. As for me it's working fine (with the local /
>> tmp), I assume the nobody/nogroup comes from any squash-setting in
>> the /etc/export of you master node.
>>
>> -- Reuti
>>
>>
>>> parent directory and "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-
>>> eg_at_charlie_0/53199/0/0" is the subdirectory... not the other way
>>> around.
>>>
>>> Eloi
>>>
>>>
>>>
>>> Ralph Castain wrote:
>>>> Creating a directory with such credentials sounds like a bug in
>>>> SGE to me...perhaps an SGE config issue?
>>>>
>>>> Only thing you could do is tell OMPI to use some other directory
>>>> as the root for its session dir tree - check "mpirun -h", or
>>>> ompi_info for the required option.
>>>>
>>>> But I would first check your SGE config as that just doesn't
>>>> sound right.
>>>>
>>>> On Nov 10, 2009, at 9:40 AM, Eloi Gaudry wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I'm experiencing some issues using GE6.2U4 and OpenMPI-1.3.3
>>>>> (with gridengine compnent).
>>>>>
>>>>> During any job submission, SGE creates a session directory in
>>>>> $TMPDIR, named after the job id and the computing node name.
>>>>> This session directory is created using nobody/nogroup
>>>>> credentials.
>>>>>
>>>>> When using OpenMPI with tight-integration, opal creates
>>>>> different subdirectories in this session directory. The issue
>>>>> I'm facing now is that OpenMPI fails to create these
>>>>> subdirectories:
>>>>>
>>>>> [charlie:03882] opal_os_dirpath_create: Error: Unable to create
>>>>> the sub-directory (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-
>>>>> eg_at_charlie_0) of (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-
>>>>> eg_at_charlie_0
>>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
>>>>> file ../../openmpi-1.3.3/orte/util/session_dir.c at line 101
>>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
>>>>> file ../../openmpi-1.3.3/orte/util/session_dir.c at line 425
>>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
>>>>> file ../../../../../openmpi-1.3.3/orte/mca/ess/hnp/
>>>>> ess_hnp_module.c at line 273
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> It looks like orte_init failed for some reason; your parallel
>>>>> process is
>>>>> likely to abort. There are many reasons that a parallel
>>>>> process can
>>>>> fail during orte_init; some of which are due to configuration or
>>>>> environment problems. This failure appears to be an internal
>>>>> failure;
>>>>> here's some additional information (which may only be relevant
>>>>> to an
>>>>> Open MPI developer):
>>>>>
>>>>> orte_session_dir failed
>>>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
>>>>> file ../../openmpi-1.3.3/orte/runtime/orte_init.c at line 132
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> It looks like orte_init failed for some reason; your parallel
>>>>> process is
>>>>> likely to abort. There are many reasons that a parallel
>>>>> process can
>>>>> fail during orte_init; some of which are due to configuration or
>>>>> environment problems. This failure appears to be an internal
>>>>> failure;
>>>>> here's some additional information (which may only be relevant
>>>>> to an
>>>>> Open MPI developer):
>>>>>
>>>>> orte_ess_set_name failed
>>>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in
>>>>> file ../../../../openmpi-1.3.3/orte/tools/orterun/orterun.c at
>>>>> line 473
>>>>>
>>>>> This seems very likely related to the permissions set on $TMPDIR.
>>>>>
>>>>> I'd like to know if someone might have experienced the same or
>>>>> a similar issue and if any solution was found.
>>>>>
>>>>> Thanks for your help,
>>>>> Eloi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Eloi Gaudry
>>>>>
>>>>> Free Field Technologies
>>>>> Axis Park Louvain-la-Neuve
>>>>> Rue Emile Francqui, 1
>>>>> B-1435 Mont-Saint Guibert
>>>>> BELGIUM
>>>>>
>>>>> Company Phone: +32 10 487 959
>>>>> Company Fax: +32 10 454 626
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>>
>>>
>>> Eloi Gaudry
>>>
>>> Free Field Technologies
>>> Axis Park Louvain-la-Neuve
>>> Rue Emile Francqui, 1
>>> B-1435 Mont-Saint Guibert
>>> BELGIUM
>>>
>>> Company Phone: +32 10 487 959
>>> Company Fax: +32 10 454 626
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
>
>
> Eloi Gaudry
>
> Free Field Technologies
> Axis Park Louvain-la-Neuve
> Rue Emile Francqui, 1
> B-1435 Mont-Saint Guibert
> BELGIUM
>
> Company Phone: +32 10 487 959
> Company Fax: +32 10 454 626
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users