Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-03-20 20:54:28


FWIW, most LDAP installations I have seen have ended up doing the
same thing -- if you have a large enough cluster, you have MPI jobs
starting all the time, and rate control of a single job startup is
not sufficient to avoid overloading your LDAP server.

The solutions that I have seen typically have a job fired once a day
via cron that dumps relevant information from LDAP into local /etc/
passwd / shadow / group files and then simply use that for
authentication across the cluster.

Hope that helps.

On Mar 18, 2007, at 8:34 PM, David Bronke wrote:

> That's great to hear! For now we'll just create local users for those
> who need access to MPI on this system, but I'll keep an eye on the
> list for when you do get a chance to finish that fix. Thanks again!
>
> On 3/18/07, Ralph Castain <rhc_at_[hidden]> wrote:
>> Excellent! Yes, we use pipe in several places, including in the
>> run-time
>> during various stages of launch, so that could be a problem.
>>
>> Also, be aware that other users have reported problems on LDAP-
>> based systems
>> when attempting to launch large jobs. The problem is that the
>> OpenMPI launch
>> system has no rate control in it - and the LDAP's slapd servers get
>> overwhelmed by the launch when we ssh on a large number of nodes.
>>
>> I promised another user to concoct a fix for this problem, but am
>> taking a
>> break from the project for a few months so it may be a little
>> while before a
>> fix is available. When I do get it done, it may or may not make it
>> into an
>> OpenMPI release for some time - I'm not sure how they will decide to
>> schedule the change (is it a "bug", or a new "feature"?). So I may
>> do an
>> interim release as a patch on the OpenRTE site (since that is the
>> run-time
>> underneath OpenMPI). I'll let people know via this mailing list
>> either way.
>>
>> Ralph
>>
>>
>>
>> On 3/18/07 2:06 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>
>>> I just received an email from a friend who is helping me work on
>>> resolving this; he was able to trace the problem back to a pipe()
>>> call
>>> in OpenMPI apparently:
>>>
>>>> The problem is with the pipe() system call (which is invoked by the
>>>> MPI_Send() as far as I can tell) by a LDAP authenticated user.
>>>> Still
>>>> working out where exactly that goes wrong, but the fact is that
>>>> it isn't
>>>> actually a permissions problem - the reason it works as root is
>>>> because
>>>> root is a local user and does /etc/passwd normal authentication.
>>>
>>> I had forgotten to mention that we use LDAP for authentication on
>>> this
>>> machine; PAM and NSS are set up to use it, but I'm guessing that
>>> either OpenMPI itself or the pipe() system call won't check with
>>> them
>>> when needed... We have made some local users on the machine to get
>>> things going, but I'll probably have to find an LDAP mailing list to
>>> get this issue resolved.
>>>
>>> Thanks for all the help so far!
>>>
>>> On 3/16/07, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> I'm afraid I have zero knowledge or experience with gentoo
>>>> portage, so I
>>>> can't help you there. I always install our releases from the
>>>> tarball source
>>>> as it is pretty trivial to do and avoids any issues.
>>>>
>>>> I will have to defer to someone who knows that system to help
>>>> you from here.
>>>> It sounds like an installation or configuration issue.
>>>>
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 3/16/07 3:15 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>>>
>>>>> On 3/15/07, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>> Hmmm...well, a few thoughts to hopefully help with the
>>>>>> debugging. One
>>>>>> initial comment, though - 1.1.2 is quite old. You might want
>>>>>> to upgrade to
>>>>>> 1.2 (releasing momentarily - you can use the last release
>>>>>> candidate in the
>>>>>> interim as it is identical).
>>>>>
>>>>> Version 1.2 doesn't seem to be in gentoo portage yet, so I may
>>>>> end up
>>>>> having to compile from source... I generally prefer to do
>>>>> everything
>>>>> from portage if possible, because it makes upgrades and
>>>>> maintenance
>>>>> much cleaner.
>>>>>
>>>>>> Meantime, looking at this output, there appear to be a couple
>>>>>> of common
>>>>>> possibilities. First, I don't see any of the diagnostic output
>>>>>> from after
>>>>>> we
>>>>>> do a local fork (we do this prior to actually launching the
>>>>>> daemon). Is it
>>>>>> possible your system doesn't allow you to fork processes (some
>>>>>> don't,
>>>>>> though
>>>>>> it's unusual)?
>>>>>
>>>>> I don't see any problems with forking on this system... I'm
>>>>> able to
>>>>> start a dbus daemon as a regular user without any problems.
>>>>>
>>>>>> Second, it could be that the "orted" program isn't being found
>>>>>> in your
>>>>>> path.
>>>>>> People often forget that the path in shells started up by
>>>>>> programs isn't
>>>>>> necessarily the same as that in their login shell. You might
>>>>>> try executing
>>>>>> a
>>>>>> simple shellscript that outputs the results of "which orted"
>>>>>> to verify this
>>>>>> is correct.
>>>>>
>>>>> 'which orted' from a shell script gives me '/usr/bin/orted', which
>>>>> seems to be correct.
>>>>>
>>>>>> BTW, I should have asked as well: what are you running this
>>>>>> on, and how did
>>>>>> you configure openmpi?
>>>>>
>>>>> I'm running this on two identical machines with 2 dual-core
>>>>> hyperthreading Xeon processors. (EM64T) I simply installed OpenMPI
>>>>> using portage, with the USE flags "debug fortran pbs -threads".
>>>>> (I've
>>>>> also tried it with "-debug fortran pbs threads")
>>>>>
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/15/07 5:33 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>>>>>
>>>>>>> I'm using OpenMPI version 1.1.2. I installed it using gentoo
>>>>>>> portage,
>>>>>>> so I think it has the right permissions... I tried doing
>>>>>>> 'equery f
>>>>>>> openmpi | xargs ls -dl' and inspecting the permissions of
>>>>>>> each file,
>>>>>>> and I don't see much out of the ordinary; it is all owned by
>>>>>>> root:root, but every file has read permission for user,
>>>>>>> group, and
>>>>>>> other. (and execute for each as well when appropriate) From
>>>>>>> the debug
>>>>>>> output, I can tell that mpirun is creating the session tree
>>>>>>> in /tmp,
>>>>>>> and it does seem to be working fine... Here's the output when
>>>>>>> using
>>>>>>> --debug-daemons:
>>>>>>>
>>>>>>> $ mpirun -aborted 8 -v -d --debug-daemons -np 8
>>>>>>> /workspace/bronke/mpi/hello
>>>>>>> [trixie:25228] [0,0,0] setting up session dir with
>>>>>>> [trixie:25228] universe default-universe
>>>>>>> [trixie:25228] user bronke
>>>>>>> [trixie:25228] host trixie
>>>>>>> [trixie:25228] jobid 0
>>>>>>> [trixie:25228] procid 0
>>>>>>> [trixie:25228] procdir:
>>>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0/0
>>>>>>> [trixie:25228] jobdir:
>>>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0
>>>>>>> [trixie:25228] unidir:
>>>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe
>>>>>>> [trixie:25228] top: openmpi-sessions-bronke_at_trixie_0
>>>>>>> [trixie:25228] tmp: /tmp
>>>>>>> [trixie:25228] [0,0,0] contact_file
>>>>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/
>>>>>>> universe-setup.txt
>>>>>>> [trixie:25228] [0,0,0] wrote setup file
>>>>>>> [trixie:25228] pls:rsh: local csh: 0, local bash: 1
>>>>>>> [trixie:25228] pls:rsh: assuming same remote shell as local
>>>>>>> shell
>>>>>>> [trixie:25228] pls:rsh: remote csh: 0, remote bash: 1
>>>>>>> [trixie:25228] pls:rsh: final template argv:
>>>>>>> [trixie:25228] pls:rsh: /usr/bin/ssh <template> orted --
>>>>>>> debug
>>>>>>> --debug-daemons --bootproxy 1 --name <template> --num_procs 2
>>>>>>> --vpid_start 0 --nodename <template> --universe
>>>>>>> bronke_at_trixie:default-universe --nsreplica
>>>>>>> "0.0.0;tcp://141.238.31.33:43838" --gprreplica
>>>>>>> "0.0.0;tcp://141.238.31.33:43838" --mpi-call-yield 0
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state
>>>>>>> = 0x100)
>>>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty
>>>>>>> - leaving
>>>>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state
>>>>>>> = 0x80)
>>>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> mpirun noticed that job rank 1 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> mpirun noticed that job rank 2 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> mpirun noticed that job rank 3 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> mpirun noticed that job rank 4 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> mpirun noticed that job rank 5 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> mpirun noticed that job rank 6 with PID 0 on node "localhost"
>>>>>>> exited
>>>>>>> on signal 13.
>>>>>>> [trixie:25228] ERROR: A daemon on node localhost failed to
>>>>>>> start as
>>>>>>> expected.
>>>>>>> [trixie:25228] ERROR: There may be more information available
>>>>>>> from
>>>>>>> [trixie:25228] ERROR: the remote shell (see above).
>>>>>>> [trixie:25228] The daemon received a signal 13.
>>>>>>> 1 additional process aborted (not shown)
>>>>>>> [trixie:25228] sess_dir_finalize: found proc session dir
>>>>>>> empty - deleting
>>>>>>> [trixie:25228] sess_dir_finalize: found job session dir empty
>>>>>>> - deleting
>>>>>>> [trixie:25228] sess_dir_finalize: found univ session dir
>>>>>>> empty - deleting
>>>>>>> [trixie:25228] sess_dir_finalize: found top session dir empty
>>>>>>> - deleting
>>>>>>>
>>>>>>> On 3/15/07, Ralph H Castain <rhc_at_[hidden]> wrote:
>>>>>>>> It isn't a /dev issue. The problem is likely that the system
>>>>>>>> lacks
>>>>>>>> sufficient permissions to either:
>>>>>>>>
>>>>>>>> 1. create the Open MPI session directory tree. We create a
>>>>>>>> hierarchy of
>>>>>>>> subdirectories for temporary storage used for things like
>>>>>>>> your shared
>>>>>>>> memory
>>>>>>>> file - the location of the head of that tree can be
>>>>>>>> specified at run
>>>>>>>> time,
>>>>>>>> but has a series of built-in defaults it can search if you
>>>>>>>> don't specify
>>>>>>>> it
>>>>>>>> (we look at your environmental variables - e.g., TMP or
>>>>>>>> TMPDIR - as well
>>>>>>>> as
>>>>>>>> the typical Linux/Unix places). You might check to see what
>>>>>>>> your tmp
>>>>>>>> directory is, and that you have write permission into it.
>>>>>>>> Alternatively,
>>>>>>>> you
>>>>>>>> can specify your own location (where you know you have
>>>>>>>> permissions!) by
>>>>>>>> setting --tmpdir your-dir on the mpirun command line.
>>>>>>>>
>>>>>>>> 2. execute or access the various binaries and/or libraries.
>>>>>>>> This is
>>>>>>>> usually
>>>>>>>> caused when someone installs OpenMPI as root, and then tries
>>>>>>>> to execute
>>>>>>>> as
>>>>>>>> a
>>>>>>>> non-root user. Best thing here is to either run through the
>>>>>>>> installation
>>>>>>>> directory and add the correct permissions (assuming it is a
>>>>>>>> system-level
>>>>>>>> install), or reinstall as the non-root user (if the install
>>>>>>>> is solely for
>>>>>>>> you anyway).
>>>>>>>>
>>>>>>>> You can also set --debug-daemons on the mpirun command line
>>>>>>>> to get more
>>>>>>>> diagnostic output from the daemons and then send that along.
>>>>>>>>
>>>>>>>> BTW: if possible, it helps us to advise you if we know which
>>>>>>>> version of
>>>>>>>> OpenMPI you are using. ;-)
>>>>>>>>
>>>>>>>> Hope that helps.
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/15/07 1:51 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> Ok, now that I've figured out what the signal means, I'm
>>>>>>>>> wondering
>>>>>>>>> exactly what is running into permission problems... the
>>>>>>>>> program I'm
>>>>>>>>> running doesn't use any functions except printf, sprintf,
>>>>>>>>> and MPI_*...
>>>>>>>>> I was thinking that possibly changes to permissions on
>>>>>>>>> certain /dev
>>>>>>>>> entries in newer distros might cause this, but I'm not even
>>>>>>>>> sure what
>>>>>>>>> /dev entries would be used by MPI.
>>>>>>>>>
>>>>>>>>> On 3/15/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> If the perror command is available on your system
>>>>>>>>>> it will tell
>>>>>>>>>> you what the message is associated with the signal value.
>>>>>>>>>> On my system
>>>>>>>>>> RHEL4U3, it is permission denied.
>>>>>>>>>>
>>>>>>>>>> HTH,
>>>>>>>>>>
>>>>>>>>>> mac mccalla
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: users-bounces_at_[hidden] [mailto:users-
>>>>>>>>>> bounces_at_[hidden]] On
>>>>>>>>>> Behalf Of David Bronke
>>>>>>>>>> Sent: Thursday, March 15, 2007 12:25 PM
>>>>>>>>>> To: users_at_[hidden]
>>>>>>>>>> Subject: [OMPI users] Signal 13
>>>>>>>>>>
>>>>>>>>>> I've been trying to get OpenMPI working on two of the
>>>>>>>>>> computers at a
>>>>>>>>>> lab
>>>>>>>>>> I help administer, and I'm running into a rather large
>>>>>>>>>> issue. When
>>>>>>>>>> running anything using mpirun as a normal user, I get the
>>>>>>>>>> following
>>>>>>>>>> output:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> $ mpirun --no-daemonize --host
>>>>>>>>>>
>> localhost,localhost,localhost,localhost,localhost,localhost,localhost
>> ,l>>>>>>>>
>> o
>>>>>>>>>> calhost
>>>>>>>>>> /workspace/bronke/mpi/hello
>>>>>>>>>> mpirun noticed that job rank 0 with PID 0 on node
>>>>>>>>>> "localhost" exited on
>>>>>>>>>> signal 13.
>>>>>>>>>> [trixie:18104] ERROR: A daemon on node localhost failed to
>>>>>>>>>> start as
>>>>>>>>>> expected.
>>>>>>>>>> [trixie:18104] ERROR: There may be more information
>>>>>>>>>> available from
>>>>>>>>>> [trixie:18104] ERROR: the remote shell (see above).
>>>>>>>>>> [trixie:18104] The daemon received a signal 13.
>>>>>>>>>> 8 additional processes aborted (not shown)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However, running the same exact command line as root works
>>>>>>>>>> fine:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> $ sudo mpirun --no-daemonize --host
>>>>>>>>>>
>> localhost,localhost,localhost,localhost,localhost,localhost,localhost
>> ,l>>>>>>>>
>> o
>>>>>>>>>> calhost
>>>>>>>>>> /workspace/bronke/mpi/hello
>>>>>>>>>> Password:
>>>>>>>>>> p is 8, my_rank is 0
>>>>>>>>>> p is 8, my_rank is 1
>>>>>>>>>> p is 8, my_rank is 2
>>>>>>>>>> p is 8, my_rank is 3
>>>>>>>>>> p is 8, my_rank is 6
>>>>>>>>>> p is 8, my_rank is 7
>>>>>>>>>> Greetings from process 1!
>>>>>>>>>>
>>>>>>>>>> Greetings from process 2!
>>>>>>>>>>
>>>>>>>>>> Greetings from process 3!
>>>>>>>>>>
>>>>>>>>>> p is 8, my_rank is 5
>>>>>>>>>> p is 8, my_rank is 4
>>>>>>>>>> Greetings from process 4!
>>>>>>>>>>
>>>>>>>>>> Greetings from process 5!
>>>>>>>>>>
>>>>>>>>>> Greetings from process 6!
>>>>>>>>>>
>>>>>>>>>> Greetings from process 7!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've looked up signal 13, and have found that it is
>>>>>>>>>> apparently SIGPIPE;
>>>>>>>>>> I also found a thread on the LAM-MPI site:
>>>>>>>>>> http://www.lam-mpi.org/MailArchives/lam/2004/08/8486.php
>>>>>>>>>> However, this thread seems to indicate that the problem
>>>>>>>>>> would be in the
>>>>>>>>>> application, (/workspace/bronke/mpi/hello in this case)
>>>>>>>>>> but there are
>>>>>>>>>> no
>>>>>>>>>> pipes in use in this app, and the fact that it works as
>>>>>>>>>> expected as
>>>>>>>>>> root
>>>>>>>>>> doesn't seem to fit either. I have tried running mpirun
>>>>>>>>>> with --verbose
>>>>>>>>>> and it doesn't show any more output than without it, so
>>>>>>>>>> I've run into a
>>>>>>>>>> sort of dead-end on this issue. Does anyone know of any
>>>>>>>>>> way I can
>>>>>>>>>> figure
>>>>>>>>>> out what's going wrong or how I can fix it?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> --
>>>>>>>>>> David H. Bronke
>>>>>>>>>> Lead Programmer
>>>>>>>>>> G33X Nexus Entertainment
>>>>>>>>>> http://games.g33xnexus.com/precursors/
>>>>>>>>>>
>>>>>>>>>>
>> v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8
>> +9ORPa22s6MSr>>>>>>>>
>> 7
>>>>>>>>>> p6
>>>>>>>>>> hackerkey.com
>>>>>>>>>> Support Web Standards! http://www.webstandards.org/
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> David H. Bronke
> Lead Programmer
> G33X Nexus Entertainment
> http://games.g33xnexus.com/precursors/
>
> v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8
> +9ORPa22s6MSr7p6
> hackerkey.com
> Support Web Standards! http://www.webstandards.org/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems