Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error: Unable to create the sub-directory (/tmp/openmpi etc...)
From: Reuti (reuti_at_[hidden])
Date: 2013-12-17 17:26:02


Hi,

Am 17.12.2013 um 22:32 schrieb Brandon Turner:

> I've been struggling with this problem for a few days now and am out of ideas. I am submitting a job using TORQUE on a beowulf cluster. One step involves running mpiexec, and that is where this error occurs. I've found some similar other queries in the past:
>
> http://www.open-mpi.org/community/lists/users/att-11378/attachment
>
> http://www.open-mpi.org/community/lists/users/2013/09/22608.php
>
> http://www.open-mpi.org/community/lists/users/2009/11/11129.php
>
> I'm new to using open-mpi so much of this is very new to me. However, it does not seem that my /tmp folder is full as far as I can tell. I've tried reassigning the temporary directory using the MCA attribute (i.e. mpiexec --mca orte_tmpdir_base /home/pathA/pathB process argument1 argument2 argument3), but that was unsuccessful as well. Similarly, if thousands of sub-directories are being created, I have no idea where those would be if this is some ext3 violation issue. It's worth noting that when I submit this job--it works on some occassions and not on others. I suspect it has something to do with the nodes that I am assigned and some property of certain nodes that is an issue.
>
> It never used to have this problem until a few days ago, and now I mostly can't get it to work except on a few occasions, which makes me think that perhaps it is a node-specific issue. Any thoughts or suggestions would be much appreciated!

a) As it's not your personal /tmp, but a machine wide, it might be full on this particular node.

b) Or the admin changed the permissions on /tmp so that only Torque can generate any temporary directory therein, and any additional one created by a batch job should go to $TMPDIR which is created and removed by Torque for your particular job. It might be that Open MPI is not tightly integrated into your Torque installation. Did you ever have the chance to peek on a node whether your MPI processes are kids of pbs_mom and not of any ssh connection?

-- Reuti

> Thanks,
>
> Brandon
>
> PS I've copied the full error output below:
> [bc11bl08.deac.wfu.edu:31532] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/openmpi-sessions-turnbe8_at_[hidden]_0) of (/tmp/openmpi-sessions-turnbe8_at_[hidden]_0/2243/0/7), mkdir failed [1]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 106
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 399
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file ../../../../orte/mca/ess/base/ess_base_std_orted.c at line 283
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to [[INVALID],INVALID]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../orte/util/show_help.c at line 627
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/ess/tm/ess_tm_module.c at line 112
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to [[INVALID],INVALID]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../orte/util/show_help.c at line 627
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file ../../orte/runtime/orte_init.c at line 128
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to [[INVALID],INVALID]
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file ../../orte/util/show_help.c at line 627
> [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in file ../../orte/orted/orted_main.c at line 357
> =>> PBS: job killed: walltime 3626 exceeded limit 3600
> Terminated
> mpiexec: killing job...
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users