Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-10-31 01:22:14


On Oct 30, 2007, at 9:42 AM, Jorge Parra wrote:

> Thank you for your reply. Linux does not freeze. The one that
> freezes is
> OpenMPI. Sorry for my unaccurate choice of words that led to
> confusion.
> Therefore dmesg does not show anything abnormal (I attached to this
> email
> a full dmesg log, captured when openmpi freezes).
>
> When openmpi ferezes I can, from another terminal, see that the
> node on
> which openmpi is originaly run (the local one) has two processes:
> orted
> and mpirun. The remote node has one: orted. This seems to be normal.
> However, in both nodes there are not any openmpi activity. There is
> only
> an initial "calling init" printout in the local node (I included it in
> the greetins.c program for testing purposes).
>
> Unfortunately, I have not been able to compile openmpi 1.2.4 or any
> of the
> 1.2 trunk versions. Trunks 1.0 and 1.1 copiled well in my system. I
> already opened a case for this, but I received a message that the
> person
> it was assigned is in paternal leave. So I think I need to wait a
> bit for
> help on that :). So I am stuck with version 1.1.5.

Are you referring to this thread:

     http://www.open-mpi.org/community/lists/users/2007/10/4218.php

There's currently only one person on paternal leave, and although he
is the powerpc guy :-), he's not really the build system guy (I'm
kinda *guessing* that either OMPI or libltdl is choosing to build or
link the wrong object -- but that's a SWAG without seeing any
additional information).

I sent you a reply on 24 Oct asking for a bit more information:

     http://www.open-mpi.org/community/lists/users/2007/10/4310.php

> I am running openmpi as root because my system has some special
> conditions. This is an attempt to make an embedded Massive Parallel
> Processor (MPP), so the nodes are running embedded versions of linux,
> where normally there is just one user (root). Since this is an
> isolated
> system, I did not thing this could be a problem (I don't care about
> security issues too).
>
> Again, thank you for all your help,
>
> Jorge
>
>
>
> On Tue, 30 Oct 2007, Rainer Keller wrote:
>
>> Hello Jorge,
>> On Monday 29 October 2007 18:27, Jorge Parra wrote:
>>> When running openMPI my system freezes when initializing MPI
>>> (function
>>> MPI_init). This happens only when I try to run the process in
>>> multiples
>>> nodes in my cluster. Running multiple instances of the testing code
>>> locally (i.e ./mpirun -np 2 greetings) is succesful.
>> would it be possible to repeat the tests with the latest Open
>> MPI-1.2.4
>> version?
>>
>> Even though nothing in Open MPI should make Your system freeze.
>> Could You check the logs on the nodes and possibly have a dmesg
>> created just
>> before the MPI_Init...
>>
>>> - rsh runs well, and is configured to full access. (i.e. rsh
>>> "192.168.1.103 date" is succesful, so they are "rsh AFRLMPPBM2
>>> date" or
>>> "rsh AFRLMPPBM2.MPPdomain.com"). Security is not an issue in this
>>> system.
>>>
>>> - uname -n and hostname return a valid hostname
>>>
>>> - The testing code (attached to this email) is run (and fails) as:
>>> ./mpirun --hostfile /root/hostfile -np 2 greetings . The hostfile
>>> has the
>>> names of the localnode (first entry:AFRLMPPBM1) and the remote node
>>> (second entry: AFRLMPPBM2). This file is also attached to this
>>> email.
>>>
>>> - The environment variables seem to be properly set (see env.log
>>> attached
>>> file). Local mpi programs (i.e. ./mpirun -np 2 greetings) run well.
>>>
>>> -.profile has the path information for both the executables and the
>>> libraries
>>>
>>> - orted runs in the remote node, however it does not print
>>> anything in
>>> console. The only output in the remote node is:
>>>
>>> pam_rhosts_auth[235]: user root has a `+' user entry
>>> pam_rhosts_auth[235]: allowed to root_at_[hidden] as
>>> root
>>> PAM_unix[235]: (rsh) session opened for user root by (uid=0)
>>> in.rshd[236]: root_at_[hidden] as root: cmd='( ! [ -e
>>> ./.profile ]
>>>
>>> || . ./.profile; orted --bootproxy 1 --name 0.0.1 --num_procs 3
>> You're running as root? Why is that?
>>
>>> Then the remote process returns command prompt. However orted is
>>> in the
>>> background. The local process is frozen, and just prints:
>>> "Calling init",
>>> which is just before MPI_Init (see greetings.c).
>>>
>>> I believe the COMM WORLD cannot be correctly initialized. However
>>> I can't
>>> see which part of my configuration is wrong.
>>>
>>> Any help is greatly appreciated.
>>
>> With best regards,
>> Rainer
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems