Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Bill Johnstone (beejstone3_at_[hidden])
Date: 2007-07-18 14:19:33


--- Ralph Castain <rhc_at_[hidden]> wrote:

> Unfortunately, we don't have more debug statements internal to that
> function. I'll have to create a patch for you that will add some so
> we can
> better understand why it is failing - will try to send it to you on
> Wed.

Thank you for the patch you sent.

I solved the problem. It was a head-slapper of an error. Turned out
that I had forgotten -- the permissions on the filesystem override the
permissions of the mount point. As I mentioned, these machines have an
NFS root filesystem. In that filesystem, tmp has permissions 1777.
However, when each node mounts its local temp partition to /tmp, the
permissions on that filesystem are the permissions the mount point
takes on.

In this case, I had forgotten to apply permissions 1777 to /tmp after
mounting on each machine. As a result, /tmp really did not have the
appropriate permissions for mpirun to write to it as necessary.

Your patch helped me figure this out. Technically, I should have been
able to figure it out from the messages you'd already sent to the
mailing list, but it wasn't until I saw the line in session_dir.c where
the error was occurring that I realized it had to be some kind of
permissions error.

I've attached the new debug output below:

[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 108
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 391
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 626
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process
is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value -1 instead of ORTE_SUCCESS

--------------------------------------------------------------------------
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -1 instead of ORTE_SUCCESS.

Starting at line 108 of session_dir.c, is:

if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))
{
        ORTE_ERROR_LOG(ret);
}

Three further points:

-Is there some reason ORTE can't bail out gracefully upon this error,
instead of hanging like it was doing for me?

-I think leaving in the extra debug logging code you sent me in the
patch for future Open MPI versions would be a good idea to help
troubleshoot problems like this.

-It would be nice to see "--debug-daemons" added to the Troubleshooting
section of the FAQ on the web site.

Thank you very very much for your help Ralph and everyone else that replied.

       
____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more.
http://mobile.yahoo.com/go?refer=1GNXIC