Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: David Bronke (whitelynx_at_[hidden])
Date: 2007-03-18 20:34:16


That's great to hear! For now we'll just create local users for those
who need access to MPI on this system, but I'll keep an eye on the
list for when you do get a chance to finish that fix. Thanks again!

On 3/18/07, Ralph Castain <rhc_at_[hidden]> wrote:
> Excellent! Yes, we use pipe in several places, including in the run-time
> during various stages of launch, so that could be a problem.
>
> Also, be aware that other users have reported problems on LDAP-based systems
> when attempting to launch large jobs. The problem is that the OpenMPI launch
> system has no rate control in it - and the LDAP's slapd servers get
> overwhelmed by the launch when we ssh on a large number of nodes.
>
> I promised another user to concoct a fix for this problem, but am taking a
> break from the project for a few months so it may be a little while before a
> fix is available. When I do get it done, it may or may not make it into an
> OpenMPI release for some time - I'm not sure how they will decide to
> schedule the change (is it a "bug", or a new "feature"?). So I may do an
> interim release as a patch on the OpenRTE site (since that is the run-time
> underneath OpenMPI). I'll let people know via this mailing list either way.
>
> Ralph
>
>
>
> On 3/18/07 2:06 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>
> > I just received an email from a friend who is helping me work on
> > resolving this; he was able to trace the problem back to a pipe() call
> > in OpenMPI apparently:
> >
> >> The problem is with the pipe() system call (which is invoked by the
> >> MPI_Send() as far as I can tell) by a LDAP authenticated user. Still
> >> working out where exactly that goes wrong, but the fact is that it isn't
> >> actually a permissions problem - the reason it works as root is because
> >> root is a local user and does /etc/passwd normal authentication.
> >
> > I had forgotten to mention that we use LDAP for authentication on this
> > machine; PAM and NSS are set up to use it, but I'm guessing that
> > either OpenMPI itself or the pipe() system call won't check with them
> > when needed... We have made some local users on the machine to get
> > things going, but I'll probably have to find an LDAP mailing list to
> > get this issue resolved.
> >
> > Thanks for all the help so far!
> >
> > On 3/16/07, Ralph Castain <rhc_at_[hidden]> wrote:
> >> I'm afraid I have zero knowledge or experience with gentoo portage, so I
> >> can't help you there. I always install our releases from the tarball source
> >> as it is pretty trivial to do and avoids any issues.
> >>
> >> I will have to defer to someone who knows that system to help you from here.
> >> It sounds like an installation or configuration issue.
> >>
> >> Ralph
> >>
> >>
> >>
> >> On 3/16/07 3:15 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
> >>
> >>> On 3/15/07, Ralph Castain <rhc_at_[hidden]> wrote:
> >>>> Hmmm...well, a few thoughts to hopefully help with the debugging. One
> >>>> initial comment, though - 1.1.2 is quite old. You might want to upgrade to
> >>>> 1.2 (releasing momentarily - you can use the last release candidate in the
> >>>> interim as it is identical).
> >>>
> >>> Version 1.2 doesn't seem to be in gentoo portage yet, so I may end up
> >>> having to compile from source... I generally prefer to do everything
> >>> from portage if possible, because it makes upgrades and maintenance
> >>> much cleaner.
> >>>
> >>>> Meantime, looking at this output, there appear to be a couple of common
> >>>> possibilities. First, I don't see any of the diagnostic output from after
> >>>> we
> >>>> do a local fork (we do this prior to actually launching the daemon). Is it
> >>>> possible your system doesn't allow you to fork processes (some don't,
> >>>> though
> >>>> it's unusual)?
> >>>
> >>> I don't see any problems with forking on this system... I'm able to
> >>> start a dbus daemon as a regular user without any problems.
> >>>
> >>>> Second, it could be that the "orted" program isn't being found in your
> >>>> path.
> >>>> People often forget that the path in shells started up by programs isn't
> >>>> necessarily the same as that in their login shell. You might try executing
> >>>> a
> >>>> simple shellscript that outputs the results of "which orted" to verify this
> >>>> is correct.
> >>>
> >>> 'which orted' from a shell script gives me '/usr/bin/orted', which
> >>> seems to be correct.
> >>>
> >>>> BTW, I should have asked as well: what are you running this on, and how did
> >>>> you configure openmpi?
> >>>
> >>> I'm running this on two identical machines with 2 dual-core
> >>> hyperthreading Xeon processors. (EM64T) I simply installed OpenMPI
> >>> using portage, with the USE flags "debug fortran pbs -threads". (I've
> >>> also tried it with "-debug fortran pbs threads")
> >>>
> >>>> Ralph
> >>>>
> >>>>
> >>>>
> >>>> On 3/15/07 5:33 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
> >>>>
> >>>>> I'm using OpenMPI version 1.1.2. I installed it using gentoo portage,
> >>>>> so I think it has the right permissions... I tried doing 'equery f
> >>>>> openmpi | xargs ls -dl' and inspecting the permissions of each file,
> >>>>> and I don't see much out of the ordinary; it is all owned by
> >>>>> root:root, but every file has read permission for user, group, and
> >>>>> other. (and execute for each as well when appropriate) From the debug
> >>>>> output, I can tell that mpirun is creating the session tree in /tmp,
> >>>>> and it does seem to be working fine... Here's the output when using
> >>>>> --debug-daemons:
> >>>>>
> >>>>> $ mpirun -aborted 8 -v -d --debug-daemons -np 8
> >>>>> /workspace/bronke/mpi/hello
> >>>>> [trixie:25228] [0,0,0] setting up session dir with
> >>>>> [trixie:25228] universe default-universe
> >>>>> [trixie:25228] user bronke
> >>>>> [trixie:25228] host trixie
> >>>>> [trixie:25228] jobid 0
> >>>>> [trixie:25228] procid 0
> >>>>> [trixie:25228] procdir:
> >>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0/0
> >>>>> [trixie:25228] jobdir:
> >>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0
> >>>>> [trixie:25228] unidir:
> >>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe
> >>>>> [trixie:25228] top: openmpi-sessions-bronke_at_trixie_0
> >>>>> [trixie:25228] tmp: /tmp
> >>>>> [trixie:25228] [0,0,0] contact_file
> >>>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/universe-setup.txt
> >>>>> [trixie:25228] [0,0,0] wrote setup file
> >>>>> [trixie:25228] pls:rsh: local csh: 0, local bash: 1
> >>>>> [trixie:25228] pls:rsh: assuming same remote shell as local shell
> >>>>> [trixie:25228] pls:rsh: remote csh: 0, remote bash: 1
> >>>>> [trixie:25228] pls:rsh: final template argv:
> >>>>> [trixie:25228] pls:rsh: /usr/bin/ssh <template> orted --debug
> >>>>> --debug-daemons --bootproxy 1 --name <template> --num_procs 2
> >>>>> --vpid_start 0 --nodename <template> --universe
> >>>>> bronke_at_trixie:default-universe --nsreplica
> >>>>> "0.0.0;tcp://141.238.31.33:43838" --gprreplica
> >>>>> "0.0.0;tcp://141.238.31.33:43838" --mpi-call-yield 0
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x100)
> >>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> >>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x80)
> >>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> mpirun noticed that job rank 1 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> mpirun noticed that job rank 2 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> mpirun noticed that job rank 3 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> mpirun noticed that job rank 4 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> mpirun noticed that job rank 5 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> mpirun noticed that job rank 6 with PID 0 on node "localhost" exited
> >>>>> on signal 13.
> >>>>> [trixie:25228] ERROR: A daemon on node localhost failed to start as
> >>>>> expected.
> >>>>> [trixie:25228] ERROR: There may be more information available from
> >>>>> [trixie:25228] ERROR: the remote shell (see above).
> >>>>> [trixie:25228] The daemon received a signal 13.
> >>>>> 1 additional process aborted (not shown)
> >>>>> [trixie:25228] sess_dir_finalize: found proc session dir empty - deleting
> >>>>> [trixie:25228] sess_dir_finalize: found job session dir empty - deleting
> >>>>> [trixie:25228] sess_dir_finalize: found univ session dir empty - deleting
> >>>>> [trixie:25228] sess_dir_finalize: found top session dir empty - deleting
> >>>>>
> >>>>> On 3/15/07, Ralph H Castain <rhc_at_[hidden]> wrote:
> >>>>>> It isn't a /dev issue. The problem is likely that the system lacks
> >>>>>> sufficient permissions to either:
> >>>>>>
> >>>>>> 1. create the Open MPI session directory tree. We create a hierarchy of
> >>>>>> subdirectories for temporary storage used for things like your shared
> >>>>>> memory
> >>>>>> file - the location of the head of that tree can be specified at run
> >>>>>> time,
> >>>>>> but has a series of built-in defaults it can search if you don't specify
> >>>>>> it
> >>>>>> (we look at your environmental variables - e.g., TMP or TMPDIR - as well
> >>>>>> as
> >>>>>> the typical Linux/Unix places). You might check to see what your tmp
> >>>>>> directory is, and that you have write permission into it. Alternatively,
> >>>>>> you
> >>>>>> can specify your own location (where you know you have permissions!) by
> >>>>>> setting --tmpdir your-dir on the mpirun command line.
> >>>>>>
> >>>>>> 2. execute or access the various binaries and/or libraries. This is
> >>>>>> usually
> >>>>>> caused when someone installs OpenMPI as root, and then tries to execute
> >>>>>> as
> >>>>>> a
> >>>>>> non-root user. Best thing here is to either run through the installation
> >>>>>> directory and add the correct permissions (assuming it is a system-level
> >>>>>> install), or reinstall as the non-root user (if the install is solely for
> >>>>>> you anyway).
> >>>>>>
> >>>>>> You can also set --debug-daemons on the mpirun command line to get more
> >>>>>> diagnostic output from the daemons and then send that along.
> >>>>>>
> >>>>>> BTW: if possible, it helps us to advise you if we know which version of
> >>>>>> OpenMPI you are using. ;-)
> >>>>>>
> >>>>>> Hope that helps.
> >>>>>> Ralph
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 3/15/07 1:51 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
> >>>>>>
> >>>>>>> Ok, now that I've figured out what the signal means, I'm wondering
> >>>>>>> exactly what is running into permission problems... the program I'm
> >>>>>>> running doesn't use any functions except printf, sprintf, and MPI_*...
> >>>>>>> I was thinking that possibly changes to permissions on certain /dev
> >>>>>>> entries in newer distros might cause this, but I'm not even sure what
> >>>>>>> /dev entries would be used by MPI.
> >>>>>>>
> >>>>>>> On 3/15/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
> >>>>>>>> Hi,
> >>>>>>>> If the perror command is available on your system it will tell
> >>>>>>>> you what the message is associated with the signal value. On my system
> >>>>>>>> RHEL4U3, it is permission denied.
> >>>>>>>>
> >>>>>>>> HTH,
> >>>>>>>>
> >>>>>>>> mac mccalla
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> >>>>>>>> Behalf Of David Bronke
> >>>>>>>> Sent: Thursday, March 15, 2007 12:25 PM
> >>>>>>>> To: users_at_[hidden]
> >>>>>>>> Subject: [OMPI users] Signal 13
> >>>>>>>>
> >>>>>>>> I've been trying to get OpenMPI working on two of the computers at a
> >>>>>>>> lab
> >>>>>>>> I help administer, and I'm running into a rather large issue. When
> >>>>>>>> running anything using mpirun as a normal user, I get the following
> >>>>>>>> output:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> $ mpirun --no-daemonize --host
> >>>>>>>>
> localhost,localhost,localhost,localhost,localhost,localhost,localhost,l>>>>>>>>
> o
> >>>>>>>> calhost
> >>>>>>>> /workspace/bronke/mpi/hello
> >>>>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited on
> >>>>>>>> signal 13.
> >>>>>>>> [trixie:18104] ERROR: A daemon on node localhost failed to start as
> >>>>>>>> expected.
> >>>>>>>> [trixie:18104] ERROR: There may be more information available from
> >>>>>>>> [trixie:18104] ERROR: the remote shell (see above).
> >>>>>>>> [trixie:18104] The daemon received a signal 13.
> >>>>>>>> 8 additional processes aborted (not shown)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> However, running the same exact command line as root works fine:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> $ sudo mpirun --no-daemonize --host
> >>>>>>>>
> localhost,localhost,localhost,localhost,localhost,localhost,localhost,l>>>>>>>>
> o
> >>>>>>>> calhost
> >>>>>>>> /workspace/bronke/mpi/hello
> >>>>>>>> Password:
> >>>>>>>> p is 8, my_rank is 0
> >>>>>>>> p is 8, my_rank is 1
> >>>>>>>> p is 8, my_rank is 2
> >>>>>>>> p is 8, my_rank is 3
> >>>>>>>> p is 8, my_rank is 6
> >>>>>>>> p is 8, my_rank is 7
> >>>>>>>> Greetings from process 1!
> >>>>>>>>
> >>>>>>>> Greetings from process 2!
> >>>>>>>>
> >>>>>>>> Greetings from process 3!
> >>>>>>>>
> >>>>>>>> p is 8, my_rank is 5
> >>>>>>>> p is 8, my_rank is 4
> >>>>>>>> Greetings from process 4!
> >>>>>>>>
> >>>>>>>> Greetings from process 5!
> >>>>>>>>
> >>>>>>>> Greetings from process 6!
> >>>>>>>>
> >>>>>>>> Greetings from process 7!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I've looked up signal 13, and have found that it is apparently SIGPIPE;
> >>>>>>>> I also found a thread on the LAM-MPI site:
> >>>>>>>> http://www.lam-mpi.org/MailArchives/lam/2004/08/8486.php
> >>>>>>>> However, this thread seems to indicate that the problem would be in the
> >>>>>>>> application, (/workspace/bronke/mpi/hello in this case) but there are
> >>>>>>>> no
> >>>>>>>> pipes in use in this app, and the fact that it works as expected as
> >>>>>>>> root
> >>>>>>>> doesn't seem to fit either. I have tried running mpirun with --verbose
> >>>>>>>> and it doesn't show any more output than without it, so I've run into a
> >>>>>>>> sort of dead-end on this issue. Does anyone know of any way I can
> >>>>>>>> figure
> >>>>>>>> out what's going wrong or how I can fix it?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> --
> >>>>>>>> David H. Bronke
> >>>>>>>> Lead Programmer
> >>>>>>>> G33X Nexus Entertainment
> >>>>>>>> http://games.g33xnexus.com/precursors/
> >>>>>>>>
> >>>>>>>>
> v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8+9ORPa22s6MSr>>>>>>>>
> 7
> >>>>>>>> p6
> >>>>>>>> hackerkey.com
> >>>>>>>> Support Web Standards! http://www.webstandards.org/
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> users_at_[hidden]
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users_at_[hidden]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
David H. Bronke
Lead Programmer
G33X Nexus Entertainment
http://games.g33xnexus.com/precursors/
v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8+9ORPa22s6MSr7p6
hackerkey.com
Support Web Standards! http://www.webstandards.org/