Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-03-16 21:35:06


I'm afraid I have zero knowledge or experience with gentoo portage, so I
can't help you there. I always install our releases from the tarball source
as it is pretty trivial to do and avoids any issues.

I will have to defer to someone who knows that system to help you from here.
It sounds like an installation or configuration issue.

Ralph

On 3/16/07 3:15 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:

> On 3/15/07, Ralph Castain <rhc_at_[hidden]> wrote:
>> Hmmm...well, a few thoughts to hopefully help with the debugging. One
>> initial comment, though - 1.1.2 is quite old. You might want to upgrade to
>> 1.2 (releasing momentarily - you can use the last release candidate in the
>> interim as it is identical).
>
> Version 1.2 doesn't seem to be in gentoo portage yet, so I may end up
> having to compile from source... I generally prefer to do everything
> from portage if possible, because it makes upgrades and maintenance
> much cleaner.
>
>> Meantime, looking at this output, there appear to be a couple of common
>> possibilities. First, I don't see any of the diagnostic output from after we
>> do a local fork (we do this prior to actually launching the daemon). Is it
>> possible your system doesn't allow you to fork processes (some don't, though
>> it's unusual)?
>
> I don't see any problems with forking on this system... I'm able to
> start a dbus daemon as a regular user without any problems.
>
>> Second, it could be that the "orted" program isn't being found in your path.
>> People often forget that the path in shells started up by programs isn't
>> necessarily the same as that in their login shell. You might try executing a
>> simple shellscript that outputs the results of "which orted" to verify this
>> is correct.
>
> 'which orted' from a shell script gives me '/usr/bin/orted', which
> seems to be correct.
>
>> BTW, I should have asked as well: what are you running this on, and how did
>> you configure openmpi?
>
> I'm running this on two identical machines with 2 dual-core
> hyperthreading Xeon processors. (EM64T) I simply installed OpenMPI
> using portage, with the USE flags "debug fortran pbs -threads". (I've
> also tried it with "-debug fortran pbs threads")
>
>> Ralph
>>
>>
>>
>> On 3/15/07 5:33 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>
>>> I'm using OpenMPI version 1.1.2. I installed it using gentoo portage,
>>> so I think it has the right permissions... I tried doing 'equery f
>>> openmpi | xargs ls -dl' and inspecting the permissions of each file,
>>> and I don't see much out of the ordinary; it is all owned by
>>> root:root, but every file has read permission for user, group, and
>>> other. (and execute for each as well when appropriate) From the debug
>>> output, I can tell that mpirun is creating the session tree in /tmp,
>>> and it does seem to be working fine... Here's the output when using
>>> --debug-daemons:
>>>
>>> $ mpirun -aborted 8 -v -d --debug-daemons -np 8 /workspace/bronke/mpi/hello
>>> [trixie:25228] [0,0,0] setting up session dir with
>>> [trixie:25228] universe default-universe
>>> [trixie:25228] user bronke
>>> [trixie:25228] host trixie
>>> [trixie:25228] jobid 0
>>> [trixie:25228] procid 0
>>> [trixie:25228] procdir:
>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0/0
>>> [trixie:25228] jobdir:
>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/0
>>> [trixie:25228] unidir:
>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe
>>> [trixie:25228] top: openmpi-sessions-bronke_at_trixie_0
>>> [trixie:25228] tmp: /tmp
>>> [trixie:25228] [0,0,0] contact_file
>>> /tmp/openmpi-sessions-bronke_at_trixie_0/default-universe/universe-setup.txt
>>> [trixie:25228] [0,0,0] wrote setup file
>>> [trixie:25228] pls:rsh: local csh: 0, local bash: 1
>>> [trixie:25228] pls:rsh: assuming same remote shell as local shell
>>> [trixie:25228] pls:rsh: remote csh: 0, remote bash: 1
>>> [trixie:25228] pls:rsh: final template argv:
>>> [trixie:25228] pls:rsh: /usr/bin/ssh <template> orted --debug
>>> --debug-daemons --bootproxy 1 --name <template> --num_procs 2
>>> --vpid_start 0 --nodename <template> --universe
>>> bronke_at_trixie:default-universe --nsreplica
>>> "0.0.0;tcp://141.238.31.33:43838" --gprreplica
>>> "0.0.0;tcp://141.238.31.33:43838" --mpi-call-yield 0
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x100)
>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving
>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x80)
>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> mpirun noticed that job rank 1 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> mpirun noticed that job rank 2 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> mpirun noticed that job rank 3 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> mpirun noticed that job rank 4 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> mpirun noticed that job rank 5 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> mpirun noticed that job rank 6 with PID 0 on node "localhost" exited
>>> on signal 13.
>>> [trixie:25228] ERROR: A daemon on node localhost failed to start as
>>> expected.
>>> [trixie:25228] ERROR: There may be more information available from
>>> [trixie:25228] ERROR: the remote shell (see above).
>>> [trixie:25228] The daemon received a signal 13.
>>> 1 additional process aborted (not shown)
>>> [trixie:25228] sess_dir_finalize: found proc session dir empty - deleting
>>> [trixie:25228] sess_dir_finalize: found job session dir empty - deleting
>>> [trixie:25228] sess_dir_finalize: found univ session dir empty - deleting
>>> [trixie:25228] sess_dir_finalize: found top session dir empty - deleting
>>>
>>> On 3/15/07, Ralph H Castain <rhc_at_[hidden]> wrote:
>>>> It isn't a /dev issue. The problem is likely that the system lacks
>>>> sufficient permissions to either:
>>>>
>>>> 1. create the Open MPI session directory tree. We create a hierarchy of
>>>> subdirectories for temporary storage used for things like your shared
>>>> memory
>>>> file - the location of the head of that tree can be specified at run time,
>>>> but has a series of built-in defaults it can search if you don't specify it
>>>> (we look at your environmental variables - e.g., TMP or TMPDIR - as well as
>>>> the typical Linux/Unix places). You might check to see what your tmp
>>>> directory is, and that you have write permission into it. Alternatively,
>>>> you
>>>> can specify your own location (where you know you have permissions!) by
>>>> setting --tmpdir your-dir on the mpirun command line.
>>>>
>>>> 2. execute or access the various binaries and/or libraries. This is usually
>>>> caused when someone installs OpenMPI as root, and then tries to execute as
>>>> a
>>>> non-root user. Best thing here is to either run through the installation
>>>> directory and add the correct permissions (assuming it is a system-level
>>>> install), or reinstall as the non-root user (if the install is solely for
>>>> you anyway).
>>>>
>>>> You can also set --debug-daemons on the mpirun command line to get more
>>>> diagnostic output from the daemons and then send that along.
>>>>
>>>> BTW: if possible, it helps us to advise you if we know which version of
>>>> OpenMPI you are using. ;-)
>>>>
>>>> Hope that helps.
>>>> Ralph
>>>>
>>>>
>>>>
>>>>
>>>> On 3/15/07 1:51 PM, "David Bronke" <whitelynx_at_[hidden]> wrote:
>>>>
>>>>> Ok, now that I've figured out what the signal means, I'm wondering
>>>>> exactly what is running into permission problems... the program I'm
>>>>> running doesn't use any functions except printf, sprintf, and MPI_*...
>>>>> I was thinking that possibly changes to permissions on certain /dev
>>>>> entries in newer distros might cause this, but I'm not even sure what
>>>>> /dev entries would be used by MPI.
>>>>>
>>>>> On 3/15/07, McCalla, Mac <macmccalla_at_[hidden]> wrote:
>>>>>> Hi,
>>>>>> If the perror command is available on your system it will tell
>>>>>> you what the message is associated with the signal value. On my system
>>>>>> RHEL4U3, it is permission denied.
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> mac mccalla
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>>>>>> Behalf Of David Bronke
>>>>>> Sent: Thursday, March 15, 2007 12:25 PM
>>>>>> To: users_at_[hidden]
>>>>>> Subject: [OMPI users] Signal 13
>>>>>>
>>>>>> I've been trying to get OpenMPI working on two of the computers at a lab
>>>>>> I help administer, and I'm running into a rather large issue. When
>>>>>> running anything using mpirun as a normal user, I get the following
>>>>>> output:
>>>>>>
>>>>>>
>>>>>> $ mpirun --no-daemonize --host
>>>>>> localhost,localhost,localhost,localhost,localhost,localhost,localhost,lo
>>>>>> calhost
>>>>>> /workspace/bronke/mpi/hello
>>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited on
>>>>>> signal 13.
>>>>>> [trixie:18104] ERROR: A daemon on node localhost failed to start as
>>>>>> expected.
>>>>>> [trixie:18104] ERROR: There may be more information available from
>>>>>> [trixie:18104] ERROR: the remote shell (see above).
>>>>>> [trixie:18104] The daemon received a signal 13.
>>>>>> 8 additional processes aborted (not shown)
>>>>>>
>>>>>>
>>>>>> However, running the same exact command line as root works fine:
>>>>>>
>>>>>>
>>>>>> $ sudo mpirun --no-daemonize --host
>>>>>> localhost,localhost,localhost,localhost,localhost,localhost,localhost,lo
>>>>>> calhost
>>>>>> /workspace/bronke/mpi/hello
>>>>>> Password:
>>>>>> p is 8, my_rank is 0
>>>>>> p is 8, my_rank is 1
>>>>>> p is 8, my_rank is 2
>>>>>> p is 8, my_rank is 3
>>>>>> p is 8, my_rank is 6
>>>>>> p is 8, my_rank is 7
>>>>>> Greetings from process 1!
>>>>>>
>>>>>> Greetings from process 2!
>>>>>>
>>>>>> Greetings from process 3!
>>>>>>
>>>>>> p is 8, my_rank is 5
>>>>>> p is 8, my_rank is 4
>>>>>> Greetings from process 4!
>>>>>>
>>>>>> Greetings from process 5!
>>>>>>
>>>>>> Greetings from process 6!
>>>>>>
>>>>>> Greetings from process 7!
>>>>>>
>>>>>>
>>>>>> I've looked up signal 13, and have found that it is apparently SIGPIPE;
>>>>>> I also found a thread on the LAM-MPI site:
>>>>>> http://www.lam-mpi.org/MailArchives/lam/2004/08/8486.php
>>>>>> However, this thread seems to indicate that the problem would be in the
>>>>>> application, (/workspace/bronke/mpi/hello in this case) but there are no
>>>>>> pipes in use in this app, and the fact that it works as expected as root
>>>>>> doesn't seem to fit either. I have tried running mpirun with --verbose
>>>>>> and it doesn't show any more output than without it, so I've run into a
>>>>>> sort of dead-end on this issue. Does anyone know of any way I can figure
>>>>>> out what's going wrong or how I can fix it?
>>>>>>
>>>>>> Thanks!
>>>>>> --
>>>>>> David H. Bronke
>>>>>> Lead Programmer
>>>>>> G33X Nexus Entertainment
>>>>>> http://games.g33xnexus.com/precursors/
>>>>>>
>>>>>> v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8+9ORPa22s6MSr7
>>>>>> p6
>>>>>> hackerkey.com
>>>>>> Support Web Standards! http://www.webstandards.org/
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>