Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-03-18 17:28:18


Excellent! Yes, we use pipe in several places, including in the run-time
during various stages of launch, so that could be a problem.

Also, be aware that other users have reported problems on LDAP-based systems
when attempting to launch large jobs. The problem is that the OpenMPI launch
system has no rate control in it - and the LDAP's slapd servers get
overwhelmed by the launch when we ssh on a large number of nodes.

I promised another user to concoct a fix for this problem, but am taking a
break from the project for a few months so it may be a little while before a
fix is available. When I do get it done, it may or may not make it into an
OpenMPI release for some time - I'm not sure how they will decide to
schedule the change (is it a "bug", or a new "feature"?). So I may do an
interim release as a patch on the OpenRTE site (since that is the run-time
underneath OpenMPI). I'll let people know via this mailing list either way.

Ralph

On 3/18/07 2:06 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:

> I just received an email from a friend who is helping me work on
> resolving this; he was able to trace the problem back to a pipe() call
> in OpenMPI apparently:
>
>> The problem is with the pipe() system call (which is invoked by the
>> MPI_Send() as far as I can tell) by a LDAP authenticated user. Still
>> working out where exactly that goes wrong, but the fact is that it isn't
>> actually a permissions problem - the reason it works as root is because
>> root is a local user and does /etc/passwd normal authentication.
>
> I had forgotten to mention that we use LDAP for authentication on this
> machine; PAM and NSS are set up to use it, but I'm guessing that
> either OpenMPI itself or the pipe() system call won't check with them
> when needed... We have made some local users on the machine to get
> things going, but I'll probably have to find an LDAP mailing list to
> get this issue resolved.
>
> Thanks for all the help so far!
>
> On 3/16/07, Ralph Castain <rhc_at_[hidden]> wrote:
>> I'm afraid I have zero knowledge or experience with gentoo portage, so I
>> can't help you there. I always install our releases from the tarball source
>> as it is pretty trivial to do and avoids any issues.
>>
>> I will have to defer to someone who knows that system to help you from here.
>> It sounds like an installation or configuration issue.
>>
>> Ralph
>>
>>
>>
>> On 3/16/07 3:15 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>
>>> On 3/15/07, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> Hmmm...well, a few thoughts to hopefully help with the debugging. One
>>>> initial comment, though - 1.1.2 is quite old. You might want to upgrade to
>>>> 1.2 (releasing momentarily - you can use the last release candidate in the
>>>> interim as it is identical).
>>>
>>> Version 1.2 doesn't seem to be in gentoo portage yet, so I may end up
>>> having to compile from source... I generally prefer to do everything
>>> from portage if possible, because it makes upgrades and maintenance
>>> much cleaner.
>>>
>>>> Meantime, looking at this output, there appear to be a couple of common
>>>> possibilities. First, I don't see any of the diagnostic output from after
>>>> we
>>>> do a local fork (we do this prior to actually launching the daemon). Is it
>>>> possible your system doesn't allow you to fork processes (some don't,
>>>> though
>>>> it's unusual)?
>>>
>>> I don't see any problems with forking on this system... I'm able to
>>> start a dbus daemon as a regular user without any problems.
>>>
>>>> Second, it could be that the "orted" program isn't being found in your
>>>> path.
>>>> People often forget that the path in shells started up by programs isn't
>>>> necessarily the same as that in their login shell. You might try executing
>>>> a
>>>> simple shellscript that outputs the results of "which orted" to verify this
>>>> is correct.
>>>
>>> 'which orted' from a shell script gives me '/usr/bin/orted', which
>>> seems to be correct.
>>>
>>>> BTW, I should have asked as well: what are you running this on, and how did
>>>> you configure openmpi?
>>>
>>> I'm running this on two identical machines with 2 dual-core
>>> hyperthreading Xeon processors. (EM64T) I simply installed OpenMPI
>>> using portage, with the USE flags "debug fortran pbs -threads". (I've
>>> also tried it with "-debug fortran pbs threads")
>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 3/15/07 5:33 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>>>
>>>>> I'm using OpenMPI version 1.1.2. I installed it using gentoo portage,
>>>>> so I think it has the right permissions... I tried doing 'equery f
>>>>> openmpi | xargs ls -dl' and inspecting the permissions of each file,
>>>>> and I don't see much out of the ordinary; it is all owned by
>>>>> root:root, but every file has read permission for user, group, and
>>>>> other. (and execute for each as well when appropriate) From the debug
>>>>> output, I can tell that mpirun is creating the session tree in /tmp,
>>>>> and it does seem to be working fine... Here's the output when using
>>>>> --debug-daemons:
>>>>>
>>>>> $ mpirun -aborted 8 -v -d --debug-daemons -np 8
>>>>> /workspace/bronke/mpi/hello
>>>>> [trixie:25228] [0,0,0] setting up session dir with
>>>>> [trixie:25228] universe default-universe
>>>>> [trixie:25228] user bronke
>>>>> [trixie:25228] host trixie
>>>>> [trixie:25228] jobid 0
>>>>> [trixie:25228] procid 0
>>>>> [trixie:25228] procdir:
>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0/0
>>>>> [trixie:25228] jobdir:
>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0
>>>>> [trixie:25228] unidir:
>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe
>>>>> [trixie:25228] top: openmpi-sessions-bronke_at_trixie_0
>>>>> [trixie:25228] tmp: /tmp
>>>>> [trixie:25228] [0,0,0] contact_file
>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/universe-setup.txt
>>>>> [trixie:25228] [0,0,0] wrote setup file
>>>>> [trixie:25228] pls:rsh: local csh: 0, local bash: 1
>>>>> [trixie:25228] pls:rsh: assuming same remote shell as local shell
>>>>> [trixie:25228] pls:rsh: remote csh: 0, remote bash: 1
>>>>> [trixie:25228] pls:rsh: final template argv:
>>>>> [trixie:25228] pls:rsh: /usr/bin/ssh <template> orted --debug
>>>>> --debug-daemons --bootproxy 1 --name <template> --num_procs 2
>>>>> --vpid_start 0 --nodename <template> --universe
>>>>> bronke_at_trixie:default-universe --nsreplica
>>>>> "0.0.0;tcp://141.238.31.33:43838" --gprreplica
>>>>> "0.0.0;tcp://141.238.31.33:43838" --mpi-call-yield 0
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x100)
>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x80)
>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> mpirun noticed that job rank 1 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> mpirun noticed that job rank 2 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> mpirun noticed that job rank 3 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> mpirun noticed that job rank 4 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> mpirun noticed that job rank 5 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> mpirun noticed that job rank 6 with PID 0 on node "localhost" exited
>>>>> on signal 13.
>>>>> [trixie:25228] ERROR: A daemon on node localhost failed to start as
>>>>> expected.
>>>>> [trixie:25228] ERROR: There may be more information available from
>>>>> [trixie:25228] ERROR: the remote shell (see above).
>>>>> [trixie:25228] The daemon received a signal 13.
>>>>> 1 additional process aborted (not shown)
>>>>> [trixie:25228] sess_dir_finalize: found proc session dir empty - deleting
>>>>> [trixie:25228] sess_dir_finalize: found job session dir empty - deleting
>>>>> [trixie:25228] sess_dir_finalize: found univ session dir empty - deleting
>>>>> [trixie:25228] sess_dir_finalize: found top session dir empty - deleting
>>>>>
>>>>> On 3/15/07, Ralph H Castain <rhc_at_[hidden]> wrote:
>>>>>> It isn't a /dev issue. The problem is likely that the system lacks
>>>>>> sufficient permissions to either:
>>>>>>
>>>>>> 1. create the Open MPI session directory tree. We create a hierarchy of
>>>>>> subdirectories for temporary storage used for things like your shared
>>>>>> memory
>>>>>> file - the location of the head of that tree can be specified at run
>>>>>> time,
>>>>>> but has a series of built-in defaults it can search if you don't specify
>>>>>> it
>>>>>> (we look at your environmental variables - e.g., TMP or TMPDIR - as well
>>>>>> as
>>>>>> the typical Linux/Unix places). You might check to see what your tmp
>>>>>> directory is, and that you have write permission into it. Alternatively,
>>>>>> you
>>>>>> can specify your own location (where you know you have permissions!) by
>>>>>> setting --tmpdir your-dir on the mpirun command line.
>>>>>>
>>>>>> 2. execute or access the various binaries and/or libraries. This is
>>>>>> usually
>>>>>> caused when someone installs OpenMPI as root, and then tries to execute
>>>>>> as
>>>>>> a
>>>>>> non-root user. Best thing here is to either run through the installation
>>>>>> directory and add the correct permissions (assuming it is a system-level
>>>>>> install), or reinstall as the non-root user (if the install is solely for
>>>>>> you anyway).
>>>>>>
>>>>>> You can also set --debug-daemons on the mpirun command line to get more
>>>>>> diagnostic output from the daemons and then send that along.
>>>>>>
>>>>>> BTW: if possible, it helps us to advise you if we know which version of
>>>>>> OpenMPI you are using. ;-)
>>>>>>
>>>>>> Hope that helps.
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/15/07 1:51 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>>>>>
>>>>>>> Ok, now that I've figured out what the signal means, I'm wondering
>>>>>>> exactly what is running into permission problems... the program I'm
>>>>>>> running doesn't use any functions except printf, sprintf, and MPI_*...
>>>>>>> I was thinking that possibly changes to permissions on certain /dev
>>>>>>> entries in newer distros might cause this, but I'm not even sure what
>>>>>>> /dev entries would be used by MPI.
>>>>>>>
>>>>>>> On 3/15/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
>>>>>>>> Hi,
>>>>>>>> If the perror command is available on your system it will tell
>>>>>>>> you what the message is associated with the signal value. On my system
>>>>>>>> RHEL4U3, it is permission denied.
>>>>>>>>
>>>>>>>> HTH,
>>>>>>>>
>>>>>>>> mac mccalla
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>>>>>>>> Behalf Of David Bronke
>>>>>>>> Sent: Thursday, March 15, 2007 12:25 PM
>>>>>>>> To: users_at_[hidden]
>>>>>>>> Subject: [OMPI users] Signal 13
>>>>>>>>
>>>>>>>> I've been trying to get OpenMPI working on two of the computers at a
>>>>>>>> lab
>>>>>>>> I help administer, and I'm running into a rather large issue. When
>>>>>>>> running anything using mpirun as a normal user, I get the following
>>>>>>>> output:
>>>>>>>>
>>>>>>>>
>>>>>>>> $ mpirun --no-daemonize --host
>>>>>>>>
localhost,localhost,localhost,localhost,localhost,localhost,localhost,l>>>>>>>>
o
>>>>>>>> calhost
>>>>>>>> /workspace/bronke/mpi/hello
>>>>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited on
>>>>>>>> signal 13.
>>>>>>>> [trixie:18104] ERROR: A daemon on node localhost failed to start as
>>>>>>>> expected.
>>>>>>>> [trixie:18104] ERROR: There may be more information available from
>>>>>>>> [trixie:18104] ERROR: the remote shell (see above).
>>>>>>>> [trixie:18104] The daemon received a signal 13.
>>>>>>>> 8 additional processes aborted (not shown)
>>>>>>>>
>>>>>>>>
>>>>>>>> However, running the same exact command line as root works fine:
>>>>>>>>
>>>>>>>>
>>>>>>>> $ sudo mpirun --no-daemonize --host
>>>>>>>>
localhost,localhost,localhost,localhost,localhost,localhost,localhost,l>>>>>>>>
o
>>>>>>>> calhost
>>>>>>>> /workspace/bronke/mpi/hello
>>>>>>>> Password:
>>>>>>>> p is 8, my_rank is 0
>>>>>>>> p is 8, my_rank is 1
>>>>>>>> p is 8, my_rank is 2
>>>>>>>> p is 8, my_rank is 3
>>>>>>>> p is 8, my_rank is 6
>>>>>>>> p is 8, my_rank is 7
>>>>>>>> Greetings from process 1!
>>>>>>>>
>>>>>>>> Greetings from process 2!
>>>>>>>>
>>>>>>>> Greetings from process 3!
>>>>>>>>
>>>>>>>> p is 8, my_rank is 5
>>>>>>>> p is 8, my_rank is 4
>>>>>>>> Greetings from process 4!
>>>>>>>>
>>>>>>>> Greetings from process 5!
>>>>>>>>
>>>>>>>> Greetings from process 6!
>>>>>>>>
>>>>>>>> Greetings from process 7!
>>>>>>>>
>>>>>>>>
>>>>>>>> I've looked up signal 13, and have found that it is apparently SIGPIPE;
>>>>>>>> I also found a thread on the LAM-MPI site:
>>>>>>>> http://www.lam-mpi.org/MailArchives/lam/2004/08/8486.php
>>>>>>>> However, this thread seems to indicate that the problem would be in the
>>>>>>>> application, (/workspace/bronke/mpi/hello in this case) but there are
>>>>>>>> no
>>>>>>>> pipes in use in this app, and the fact that it works as expected as
>>>>>>>> root
>>>>>>>> doesn't seem to fit either. I have tried running mpirun with --verbose
>>>>>>>> and it doesn't show any more output than without it, so I've run into a
>>>>>>>> sort of dead-end on this issue. Does anyone know of any way I can
>>>>>>>> figure
>>>>>>>> out what's going wrong or how I can fix it?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> --
>>>>>>>> David H. Bronke
>>>>>>>> Lead Programmer
>>>>>>>> G33X Nexus Entertainment
>>>>>>>> http://games.g33xnexus.com/precursors/
>>>>>>>>
>>>>>>>>
v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8+9ORPa22s6MSr>>>>>>>>
7
>>>>>>>> p6
>>>>>>>> hackerkey.com
>>>>>>>> Support Web Standards! http://www.webstandards.org/
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>