Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: David Bronke (whitelynx_at_[hidden])
Date: 2007-03-16 17:15:21


On 3/15/07, Ralph Castain <rhc_at_[hidden]> wrote:
> Hmmm...well, a few thoughts to hopefully help with the debugging. One
> initial comment, though - 1.1.2 is quite old. You might want to upgrade to
> 1.2 (releasing momentarily - you can use the last release candidate in the
> interim as it is identical).

Version 1.2 doesn't seem to be in gentoo portage yet, so I may end up
having to compile from source... I generally prefer to do everything
from portage if possible, because it makes upgrades and maintenance
much cleaner.

> Meantime, looking at this output, there appear to be a couple of common
> possibilities. First, I don't see any of the diagnostic output from after we
> do a local fork (we do this prior to actually launching the daemon). Is it
> possible your system doesn't allow you to fork processes (some don't, though
> it's unusual)?

I don't see any problems with forking on this system... I'm able to
start a dbus daemon as a regular user without any problems.

> Second, it could be that the "orted" program isn't being found in your path.
> People often forget that the path in shells started up by programs isn't
> necessarily the same as that in their login shell. You might try executing a
> simple shellscript that outputs the results of "which orted" to verify this
> is correct.

'which orted' from a shell script gives me '/usr/bin/orted', which
seems to be correct.

> BTW, I should have asked as well: what are you running this on, and how did
> you configure openmpi?

I'm running this on two identical machines with 2 dual-core
hyperthreading Xeon processors. (EM64T) I simply installed OpenMPI
using portage, with the USE flags "debug fortran pbs -threads". (I've
also tried it with "-debug fortran pbs threads")

> Ralph
>
>
>
> On 3/15/07 5:33 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>
> > I'm using OpenMPI version 1.1.2. I installed it using gentoo portage,
> > so I think it has the right permissions... I tried doing 'equery f
> > openmpi | xargs ls -dl' and inspecting the permissions of each file,
> > and I don't see much out of the ordinary; it is all owned by
> > root:root, but every file has read permission for user, group, and
> > other. (and execute for each as well when appropriate) From the debug
> > output, I can tell that mpirun is creating the session tree in /tmp,
> > and it does seem to be working fine... Here's the output when using
> > --debug-daemons:
> >
> > $ mpirun -aborted 8 -v -d --debug-daemons -np 8 /workspace/bronke/mpi/hello
> > [trixie:25228] [0,0,0] setting up session dir with
> > [trixie:25228] universe default-universe
> > [trixie:25228] user bronke
> > [trixie:25228] host trixie
> > [trixie:25228] jobid 0
> > [trixie:25228] procid 0
> > [trixie:25228] procdir:
> > /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0/0
> > [trixie:25228] jobdir:
> > /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0
> > [trixie:25228] unidir: /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe
> > [trixie:25228] top: openmpi-sessions-bronke_at_trixie_0
> > [trixie:25228] tmp: /tmp
> > [trixie:25228] [0,0,0] contact_file
> > /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/universe-setup.txt
> > [trixie:25228] [0,0,0] wrote setup file
> > [trixie:25228] pls:rsh: local csh: 0, local bash: 1
> > [trixie:25228] pls:rsh: assuming same remote shell as local shell
> > [trixie:25228] pls:rsh: remote csh: 0, remote bash: 1
> > [trixie:25228] pls:rsh: final template argv:
> > [trixie:25228] pls:rsh: /usr/bin/ssh <template> orted --debug
> > --debug-daemons --bootproxy 1 --name <template> --num_procs 2
> > --vpid_start 0 --nodename <template> --universe
> > bronke_at_trixie:default-universe --nsreplica
> > "0.0.0;tcp://141.238.31.33:43838" --gprreplica
> > "0.0.0;tcp://141.238.31.33:43838" --mpi-call-yield 0
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x100)
> > mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
> > on signal 13.
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
> > [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x80)
> > mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
> > on signal 13.
> > mpirun noticed that job rank 1 with PID 0 on node "localhost" exited
> > on signal 13.
> > mpirun noticed that job rank 2 with PID 0 on node "localhost" exited
> > on signal 13.
> > mpirun noticed that job rank 3 with PID 0 on node "localhost" exited
> > on signal 13.
> > mpirun noticed that job rank 4 with PID 0 on node "localhost" exited
> > on signal 13.
> > mpirun noticed that job rank 5 with PID 0 on node "localhost" exited
> > on signal 13.
> > mpirun noticed that job rank 6 with PID 0 on node "localhost" exited
> > on signal 13.
> > [trixie:25228] ERROR: A daemon on node localhost failed to start as expected.
> > [trixie:25228] ERROR: There may be more information available from
> > [trixie:25228] ERROR: the remote shell (see above).
> > [trixie:25228] The daemon received a signal 13.
> > 1 additional process aborted (not shown)
> > [trixie:25228] sess_dir_finalize: found proc session dir empty - deleting
> > [trixie:25228] sess_dir_finalize: found job session dir empty - deleting
> > [trixie:25228] sess_dir_finalize: found univ session dir empty - deleting
> > [trixie:25228] sess_dir_finalize: found top session dir empty - deleting
> >
> > On 3/15/07, Ralph H Castain <rhc_at_[hidden]> wrote:
> >> It isn't a /dev issue. The problem is likely that the system lacks
> >> sufficient permissions to either:
> >>
> >> 1. create the Open MPI session directory tree. We create a hierarchy of
> >> subdirectories for temporary storage used for things like your shared memory
> >> file - the location of the head of that tree can be specified at run time,
> >> but has a series of built-in defaults it can search if you don't specify it
> >> (we look at your environmental variables - e.g., TMP or TMPDIR - as well as
> >> the typical Linux/Unix places). You might check to see what your tmp
> >> directory is, and that you have write permission into it. Alternatively, you
> >> can specify your own location (where you know you have permissions!) by
> >> setting --tmpdir your-dir on the mpirun command line.
> >>
> >> 2. execute or access the various binaries and/or libraries. This is usually
> >> caused when someone installs OpenMPI as root, and then tries to execute as a
> >> non-root user. Best thing here is to either run through the installation
> >> directory and add the correct permissions (assuming it is a system-level
> >> install), or reinstall as the non-root user (if the install is solely for
> >> you anyway).
> >>
> >> You can also set --debug-daemons on the mpirun command line to get more
> >> diagnostic output from the daemons and then send that along.
> >>
> >> BTW: if possible, it helps us to advise you if we know which version of
> >> OpenMPI you are using. ;-)
> >>
> >> Hope that helps.
> >> Ralph
> >>
> >>
> >>
> >>
> >> On 3/15/07 1:51 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
> >>
> >>> Ok, now that I've figured out what the signal means, I'm wondering
> >>> exactly what is running into permission problems... the program I'm
> >>> running doesn't use any functions except printf, sprintf, and MPI_*...
> >>> I was thinking that possibly changes to permissions on certain /dev
> >>> entries in newer distros might cause this, but I'm not even sure what
> >>> /dev entries would be used by MPI.
> >>>
> >>> On 3/15/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
> >>>> Hi,
> >>>> If the perror command is available on your system it will tell
> >>>> you what the message is associated with the signal value. On my system
> >>>> RHEL4U3, it is permission denied.
> >>>>
> >>>> HTH,
> >>>>
> >>>> mac mccalla
> >>>>
> >>>> -----Original Message-----
> >>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> >>>> Behalf Of David Bronke
> >>>> Sent: Thursday, March 15, 2007 12:25 PM
> >>>> To: users_at_[hidden]
> >>>> Subject: [OMPI users] Signal 13
> >>>>
> >>>> I've been trying to get OpenMPI working on two of the computers at a lab
> >>>> I help administer, and I'm running into a rather large issue. When
> >>>> running anything using mpirun as a normal user, I get the following
> >>>> output:
> >>>>
> >>>>
> >>>> $ mpirun --no-daemonize --host
> >>>> localhost,localhost,localhost,localhost,localhost,localhost,localhost,lo
> >>>> calhost
> >>>> /workspace/bronke/mpi/hello
> >>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited on
> >>>> signal 13.
> >>>> [trixie:18104] ERROR: A daemon on node localhost failed to start as
> >>>> expected.
> >>>> [trixie:18104] ERROR: There may be more information available from
> >>>> [trixie:18104] ERROR: the remote shell (see above).
> >>>> [trixie:18104] The daemon received a signal 13.
> >>>> 8 additional processes aborted (not shown)
> >>>>
> >>>>
> >>>> However, running the same exact command line as root works fine:
> >>>>
> >>>>
> >>>> $ sudo mpirun --no-daemonize --host
> >>>> localhost,localhost,localhost,localhost,localhost,localhost,localhost,lo
> >>>> calhost
> >>>> /workspace/bronke/mpi/hello
> >>>> Password:
> >>>> p is 8, my_rank is 0
> >>>> p is 8, my_rank is 1
> >>>> p is 8, my_rank is 2
> >>>> p is 8, my_rank is 3
> >>>> p is 8, my_rank is 6
> >>>> p is 8, my_rank is 7
> >>>> Greetings from process 1!
> >>>>
> >>>> Greetings from process 2!
> >>>>
> >>>> Greetings from process 3!
> >>>>
> >>>> p is 8, my_rank is 5
> >>>> p is 8, my_rank is 4
> >>>> Greetings from process 4!
> >>>>
> >>>> Greetings from process 5!
> >>>>
> >>>> Greetings from process 6!
> >>>>
> >>>> Greetings from process 7!
> >>>>
> >>>>
> >>>> I've looked up signal 13, and have found that it is apparently SIGPIPE;
> >>>> I also found a thread on the LAM-MPI site:
> >>>> http://www.lam-mpi.org/MailArchives/lam/2004/08/8486.php
> >>>> However, this thread seems to indicate that the problem would be in the
> >>>> application, (/workspace/bronke/mpi/hello in this case) but there are no
> >>>> pipes in use in this app, and the fact that it works as expected as root
> >>>> doesn't seem to fit either. I have tried running mpirun with --verbose
> >>>> and it doesn't show any more output than without it, so I've run into a
> >>>> sort of dead-end on this issue. Does anyone know of any way I can figure
> >>>> out what's going wrong or how I can fix it?
> >>>>
> >>>> Thanks!
> >>>> --
> >>>> David H. Bronke
> >>>> Lead Programmer
> >>>> G33X Nexus Entertainment
> >>>> http://games.g33xnexus.com/precursors/
> >>>>
> >>>> v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8+9ORPa22s6MSr7
> >>>> p6
> >>>> hackerkey.com
> >>>> Support Web Standards! http://www.webstandards.org/
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
David H. Bronke
Lead Programmer
G33X Nexus Entertainment
http://games.g33xnexus.com/precursors/
v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8+9ORPa22s6MSr7p6
hackerkey.com
Support Web Standards! http://www.webstandards.org/