Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [sge] tight-integration openmpi and sge: opal_os_dirpath_create failure
From: Eloi Gaudry (eg_at_[hidden])
Date: 2009-11-10 12:20:07


Thanks for your help Reuti,

I'm using a nfs-shared directory (/opt/sge/tmp), exported from the
master node to all others computing nodes.
  with for /etc/export on server (named moe.fft): /opt/sge
192.168.0.0/255.255.255.0(rw,sync,no_subtree_check)
                /etc/fstab on client:
moe.fft:/opt/sge
/opt/sge nfs rw,bg,soft,timeo=14, 0 0
Actually, the /opt/sge/tmp directory is 777 across all machines, thus
all user should be able to create a directory inside.

The issue seems somehow related to the session directory created inside
/opt/sge/tmp, let's stay /opt/sge/tmp/29.1.smp8.q for example for the
job 29 on queue smp8.q. This subdirectory of /opt/sge/tmp is created
with nobody:nogroup drwxr-xr-x permissions... which in turn forbids
OpenMPI to create its subtree inside (as OpenMPI won't use
nobody:nogroup credentials).

Ad Ralph suggested, I checked the SGE configuration, but I haven't found
anything related to nobody:nogroup configuration so far.

Eloi

Reuti wrote:
> Hi,
>
> Am 10.11.2009 um 17:55 schrieb Eloi Gaudry:
>
>> Thanks for your help Ralph, I'll double check that.
>>
>> As for the error message received, there might be some inconsistency:
>> "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0" is the
>
> often /opt/sge is shared across the nodes, while the /tmp (sometimes
> implemented as /scratch in a partition on its own) should be local on
> each node.
>
> What is the setting of "tmpdir" in your queue definition?
>
> If you want to share /opt/sge/tmp, everyone must be able to write into
> this location. As for me it's working fine (with the local /tmp), I
> assume the nobody/nogroup comes from any squash-setting in the
> /etc/export of you master node.
>
> -- Reuti
>
>
>> parent directory and
>> "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0/53199/0/0" is
>> the subdirectory... not the other way around.
>>
>> Eloi
>>
>>
>>
>> Ralph Castain wrote:
>>> Creating a directory with such credentials sounds like a bug in SGE
>>> to me...perhaps an SGE config issue?
>>>
>>> Only thing you could do is tell OMPI to use some other directory as
>>> the root for its session dir tree - check "mpirun -h", or ompi_info
>>> for the required option.
>>>
>>> But I would first check your SGE config as that just doesn't sound
>>> right.
>>>
>>> On Nov 10, 2009, at 9:40 AM, Eloi Gaudry wrote:
>>>
>>>> Hi there,
>>>>
>>>> I'm experiencing some issues using GE6.2U4 and OpenMPI-1.3.3 (with
>>>> gridengine compnent).
>>>>
>>>> During any job submission, SGE creates a session directory in
>>>> $TMPDIR, named after the job id and the computing node name. This
>>>> session directory is created using nobody/nogroup credentials.
>>>>
>>>> When using OpenMPI with tight-integration, opal creates different
>>>> subdirectories in this session directory. The issue I'm facing now
>>>> is that OpenMPI fails to create these subdirectories:
>>>>
>>>> [charlie:03882] opal_os_dirpath_create: Error: Unable to create the
>>>> sub-directory
>>>> (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0) of
>>>> (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../openmpi-1.3.3/orte/util/session_dir.c at line 101
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../openmpi-1.3.3/orte/util/session_dir.c at line 425
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../../../../openmpi-1.3.3/orte/mca/ess/hnp/ess_hnp_module.c at
>>>> line 273
>>>> --------------------------------------------------------------------------
>>>>
>>>> It looks like orte_init failed for some reason; your parallel
>>>> process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during orte_init; some of which are due to configuration or
>>>> environment problems. This failure appears to be an internal failure;
>>>> here's some additional information (which may only be relevant to an
>>>> Open MPI developer):
>>>>
>>>> orte_session_dir failed
>>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>> --------------------------------------------------------------------------
>>>>
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../openmpi-1.3.3/orte/runtime/orte_init.c at line 132
>>>> --------------------------------------------------------------------------
>>>>
>>>> It looks like orte_init failed for some reason; your parallel
>>>> process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during orte_init; some of which are due to configuration or
>>>> environment problems. This failure appears to be an internal failure;
>>>> here's some additional information (which may only be relevant to an
>>>> Open MPI developer):
>>>>
>>>> orte_ess_set_name failed
>>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>> --------------------------------------------------------------------------
>>>>
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../../../openmpi-1.3.3/orte/tools/orterun/orterun.c at line 473
>>>>
>>>> This seems very likely related to the permissions set on $TMPDIR.
>>>>
>>>> I'd like to know if someone might have experienced the same or a
>>>> similar issue and if any solution was found.
>>>>
>>>> Thanks for your help,
>>>> Eloi
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Eloi Gaudry
>>>>
>>>> Free Field Technologies
>>>> Axis Park Louvain-la-Neuve
>>>> Rue Emile Francqui, 1
>>>> B-1435 Mont-Saint Guibert
>>>> BELGIUM
>>>>
>>>> Company Phone: +32 10 487 959
>>>> Company Fax: +32 10 454 626
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>>
>>
>> Eloi Gaudry
>>
>> Free Field Technologies
>> Axis Park Louvain-la-Neuve
>> Rue Emile Francqui, 1
>> B-1435 Mont-Saint Guibert
>> BELGIUM
>>
>> Company Phone: +32 10 487 959
>> Company Fax: +32 10 454 626
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Eloi Gaudry
Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM
Company Phone: +32 10 487 959
Company Fax:   +32 10 454 626