Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] [sge] tight-integration openmpi and sge: opal_os_dirpath_create failure
From: Eloi Gaudry (eg_at_[hidden])
Date: 2009-11-10 12:20:07


Thanks for your help Reuti,

I'm using a nfs-shared directory (/opt/sge/tmp), exported from the
master node to all others computing nodes.
  with for /etc/export on server (named moe.fft): /opt/sge
192.168.0.0/255.255.255.0(rw,sync,no_subtree_check)
                /etc/fstab on client:
moe.fft:/opt/sge
/opt/sge nfs rw,bg,soft,timeo=14, 0 0
Actually, the /opt/sge/tmp directory is 777 across all machines, thus
all user should be able to create a directory inside.

The issue seems somehow related to the session directory created inside
/opt/sge/tmp, let's stay /opt/sge/tmp/29.1.smp8.q for example for the
job 29 on queue smp8.q. This subdirectory of /opt/sge/tmp is created
with nobody:nogroup drwxr-xr-x permissions... which in turn forbids
OpenMPI to create its subtree inside (as OpenMPI won't use
nobody:nogroup credentials).

Ad Ralph suggested, I checked the SGE configuration, but I haven't found
anything related to nobody:nogroup configuration so far.

Eloi

Reuti wrote:
> Hi,
>
> Am 10.11.2009 um 17:55 schrieb Eloi Gaudry:
>
>> Thanks for your help Ralph, I'll double check that.
>>
>> As for the error message received, there might be some inconsistency:
>> "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0" is the
>
> often /opt/sge is shared across the nodes, while the /tmp (sometimes
> implemented as /scratch in a partition on its own) should be local on
> each node.
>
> What is the setting of "tmpdir" in your queue definition?
>
> If you want to share /opt/sge/tmp, everyone must be able to write into
> this location. As for me it's working fine (with the local /tmp), I
> assume the nobody/nogroup comes from any squash-setting in the
> /etc/export of you master node.
>
> -- Reuti
>
>
>> parent directory and
>> "/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0/53199/0/0" is
>> the subdirectory... not the other way around.
>>
>> Eloi
>>
>>
>>
>> Ralph Castain wrote:
>>> Creating a directory with such credentials sounds like a bug in SGE
>>> to me...perhaps an SGE config issue?
>>>
>>> Only thing you could do is tell OMPI to use some other directory as
>>> the root for its session dir tree - check "mpirun -h", or ompi_info
>>> for the required option.
>>>
>>> But I would first check your SGE config as that just doesn't sound
>>> right.
>>>
>>> On Nov 10, 2009, at 9:40 AM, Eloi Gaudry wrote:
>>>
>>>> Hi there,
>>>>
>>>> I'm experiencing some issues using GE6.2U4 and OpenMPI-1.3.3 (with
>>>> gridengine compnent).
>>>>
>>>> During any job submission, SGE creates a session directory in
>>>> $TMPDIR, named after the job id and the computing node name. This
>>>> session directory is created using nobody/nogroup credentials.
>>>>
>>>> When using OpenMPI with tight-integration, opal creates different
>>>> subdirectories in this session directory. The issue I'm facing now
>>>> is that OpenMPI fails to create these subdirectories:
>>>>
>>>> [charlie:03882] opal_os_dirpath_create: Error: Unable to create the
>>>> sub-directory
>>>> (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0) of
>>>> (/opt/sge/tmp/25.1.smp8.q/openmpi-sessions-eg_at_charlie_0
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../openmpi-1.3.3/orte/util/session_dir.c at line 101
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../openmpi-1.3.3/orte/util/session_dir.c at line 425
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../../../../openmpi-1.3.3/orte/mca/ess/hnp/ess_hnp_module.c at
>>>> line 273
>>>> --------------------------------------------------------------------------
>>>>
>>>> It looks like orte_init failed for some reason; your parallel
>>>> process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during orte_init; some of which are due to configuration or
>>>> environment problems. This failure appears to be an internal failure;
>>>> here's some additional information (which may only be relevant to an
>>>> Open MPI developer):
>>>>
>>>> orte_session_dir failed
>>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>> --------------------------------------------------------------------------
>>>>
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../openmpi-1.3.3/orte/runtime/orte_init.c at line 132
>>>> --------------------------------------------------------------------------
>>>>
>>>> It looks like orte_init failed for some reason; your parallel
>>>> process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during orte_init; some of which are due to configuration or
>>>> environment problems. This failure appears to be an internal failure;
>>>> here's some additional information (which may only be relevant to an
>>>> Open MPI developer):
>>>>
>>>> orte_ess_set_name failed
>>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>> --------------------------------------------------------------------------
>>>>
>>>> [charlie:03882] [[53199,0],0] ORTE_ERROR_LOG: Error in file
>>>> ../../../../openmpi-1.3.3/orte/tools/orterun/orterun.c at line 473
>>>>
>>>> This seems very likely related to the permissions set on $TMPDIR.
>>>>
>>>> I'd like to know if someone might have experienced the same or a
>>>> similar issue and if any solution was found.
>>>>
>>>> Thanks for your help,
>>>> Eloi
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Eloi Gaudry
>>>>
>>>> Free Field Technologies
>>>> Axis Park Louvain-la-Neuve
>>>> Rue Emile Francqui, 1
>>>> B-1435 Mont-Saint Guibert
>>>> BELGIUM
>>>>
>>>> Company Phone: +32 10 487 959
>>>> Company Fax: +32 10 454 626
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>>
>>
>> Eloi Gaudry
>>
>> Free Field Technologies
>> Axis Park Louvain-la-Neuve
>> Rue Emile Francqui, 1
>> B-1435 Mont-Saint Guibert
>> BELGIUM
>>
>> Company Phone: +32 10 487 959
>> Company Fax: +32 10 454 626
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Eloi Gaudry
Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM
Company Phone: +32 10 487 959
Company Fax:   +32 10 454 626