Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Bill Johnstone (beejstone3_at_[hidden])
Date: 2007-07-18 14:19:33

--- Ralph Castain <rhc_at_[hidden]> wrote:

> Unfortunately, we don't have more debug statements internal to that
> function. I'll have to create a patch for you that will add some so
> we can
> better understand why it is failing - will try to send it to you on
> Wed.

Thank you for the patch you sent.

I solved the problem. It was a head-slapper of an error. Turned out
that I had forgotten -- the permissions on the filesystem override the
permissions of the mount point. As I mentioned, these machines have an
NFS root filesystem. In that filesystem, tmp has permissions 1777.
However, when each node mounts its local temp partition to /tmp, the
permissions on that filesystem are the permissions the mount point
takes on.

In this case, I had forgotten to apply permissions 1777 to /tmp after
mounting on each machine. As a result, /tmp really did not have the
appropriate permissions for mpirun to write to it as necessary.

Your patch helped me figure this out. Technically, I should have been
able to figure it out from the messages you'd already sent to the
mailing list, but it wasn't until I saw the line in session_dir.c where
the error was occurring that I realized it had to be some kind of
permissions error.

I've attached the new debug output below:

[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 108
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 391
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 626
It looks like orte_init failed for some reason; your parallel process
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value -1 instead of ORTE_SUCCESS

[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS.

Starting at line 108 of session_dir.c, is:

if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))

Three further points:

-Is there some reason ORTE can't bail out gracefully upon this error,
instead of hanging like it was doing for me?

-I think leaving in the extra debug logging code you sent me in the
patch for future Open MPI versions would be a good idea to help
troubleshoot problems like this.

-It would be nice to see "--debug-daemons" added to the Troubleshooting
section of the FAQ on the web site.

Thank you very very much for your help Ralph and everyone else that replied.

Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more.