Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
From: ETHAN DENEAULT (EDENEAULT_at_[hidden])
Date: 2010-09-20 23:06:50


David,

I did try that after I sent the original mail, but the -np 4 flag doesn't fix the problem, the program still hangs. I've also double checked the iptables for the image and for the master node, and all ports are set to accept.

Cheers,
Ethan

--
Dr. Ethan Deneault
Assistant Professor of Physics
SC 234
University of Tampa
Tampa, FL 33606
-----Original Message-----
From: users-bounces_at_[hidden] on behalf of David Zhang
Sent: Mon 9/20/2010 9:58 PM
To: Open MPI Users
Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
 
I don't know if this will help, but try
mpirun --machinefile testfile -np 4 ./test.out
for running 4 processes
On Mon, Sep 20, 2010 at 3:00 PM, Ethan Deneault <edeneault_at_[hidden]> wrote:
> All,
>
> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi,
> but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set
> correctly; because the test program does compile and run.
>
> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a
> AMD x86_64 machine which serves the diskless node images and /home as an NFS
> mount. I compile all of my programs as 32-bit.
>
> My code is a simple hello world:
> $ more test.f
>      program test
>
>      include 'mpif.h'
>      integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
>
>      call MPI_INIT(ierror)
>      call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>      print*, 'node', rank, ': Hello world'
>      call MPI_FINALIZE(ierror)
>      end
>
> If I run this program with:
>
> $ mpirun --machinefile testfile ./test.out
>  node           0 : Hello world
>  node           2 : Hello world
>  node           1 : Hello world
>
> This is the expected output. Here, testfile contains the master node:
> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
>
> If I add another machine to testfile, say 'asterope', it hangs until I
> ctrl-c it. I have tried every machine, and as long as I do not include more
> than 3 hosts, the program will not hang.
>
> I have run the debug-daemons flag with it as well, and I don't see what is
> wrong specifically.
>
> Working output: pleiades (master) and 2 nodes.
>
> $ mpirun --debug-daemons --machinefile testfile ./test.out
> Daemon was launched on m43 - beginning to initialize
> Daemon was launched on taygeta - beginning to initialize
> Daemon [[46344,0],2] checking in as pid 2140 on host m43
> Daemon [[46344,0],2] not using static ports
> [m43:02140] [[46344,0],2] orted: up and running - waiting for commands!
> [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200
> [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs
> [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200
> [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200
> [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs
> Daemon [[46344,0],1] checking in as pid 2317 on host taygeta
> Daemon [[46344,0],1] not using static ports
> [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands!
> [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200
> [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs
> [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local
> proc [[46344,1],0]
> [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc
> [[46344,1],2]
> [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local
> proc [[46344,1],1]
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
>  node           0 : Hello world
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
>  node           2 : Hello world
>  node           1 : Hello world
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
> [pleiades:19178] [[46344,0],0] orted_recv: received sync from local proc
> [[46344,1],0]
> [m43:02140] [[46344,0],2] orted_recv: received sync from local proc
> [[46344,1],2]
> [taygeta:02317] [[46344,0],1] orted_recv: received sync from local proc
> [[46344,1],1]
> [pleiades:19178] [[46344,0],0] orted_cmd: received waitpid_fired cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received iof_complete cmd
> [m43:02140] [[46344,0],2] orted_cmd: received waitpid_fired cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received waitpid_fired cmd
> [m43:02140] [[46344,0],2] orted_cmd: received iof_complete cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received iof_complete cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received exit
> [taygeta:02317] [[46344,0],1] orted_cmd: received exit
> [taygeta:02317] [[46344,0],1] orted: finalizing
> [m43:02140] [[46344,0],2] orted_cmd: received exit
> [m43:02140] [[46344,0],2] orted: finalizing
>
> Not working output: pleiades (master) and 3 nodes:
>
> $ mpirun --debug-daemons --machinefile testfile ./test.out
> Daemon was launched on m43 - beginning to initialize
> Daemon was launched on taygeta - beginning to initialize
> Daemon was launched on asterope - beginning to initialize
> Daemon [[46357,0],2] checking in as pid 2181 on host m43
> Daemon [[46357,0],2] not using static ports
> [m43:02181] [[46357,0],2] orted: up and running - waiting for commands!
> Daemon [[46357,0],1] checking in as pid 2358 on host taygeta
> Daemon [[46357,0],1] not using static ports
> [taygeta:02358] [[46357,0],1] orted: up and running - waiting for commands!
> [pleiades:19191] [[46357,0],0] node[0].name pleiades daemon 0 arch ffca0200
> [pleiades:19191] [[46357,0],0] node[1].name taygeta daemon 1 arch ffca0200
> [pleiades:19191] [[46357,0],0] node[2].name m43 daemon 2 arch ffca0200
> [pleiades:19191] [[46357,0],0] node[3].name asterope daemon 3 arch ffca0200
> [pleiades:19191] [[46357,0],0] orted_cmd: received add_local_procs
> [taygeta:02358] [[46357,0],1] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02358] [[46357,0],1] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02181] [[46357,0],2] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02358] [[46357,0],1] node[2].name m43 daemon 2 arch ffca0200
> [m43:02181] [[46357,0],2] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02181] [[46357,0],2] node[2].name m43 daemon 2 arch ffca0200
> [m43:02181] [[46357,0],2] node[3].name asterope daemon 3 arch ffca0200
> [m43:02181] [[46357,0],2] orted_cmd: received add_local_procs
> [taygeta:02358] [[46357,0],1] node[3].name asterope daemon 3 arch ffca0200
> [taygeta:02358] [[46357,0],1] orted_cmd: received add_local_procs
> Daemon [[46357,0],3] checking in as pid 1965 on host asterope
> Daemon [[46357,0],3] not using static ports
> [asterope:01965] [[46357,0],3] orted: up and running - waiting for
> commands!
> [pleiades:19191] [[46357,0],0] orted_recv: received sync+nidmap from local
> proc [[46357,1],0]
> [m43:02181] [[46357,0],2] orted_recv: received sync+nidmap from local proc
> [[46357,1],2]
> [pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd
> [m43:02181] [[46357,0],2] orted_cmd: received collective data cmd
> [pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd
>
> ------------------
> The output hangs here.
>
> After I kill the process, I get the following output:
> ------------------
>
> Killed by signal 2.
> Killed by signal 2.
> --------------------------------------------------------------------------
> A daemon (pid 19194) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> Killed by signal 2.
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> [pleiades:19191] [[46357,0],0] orted_cmd: received waitpid_fired cmd
> [pleiades:19191] [[46357,0],0] orted_cmd: received iof_complete cmd
> [pleiades:19191] [[46357,0],0] orted_cmd: received exit
> mpirun: clean termination accomplished
>
> I know that LD_LIBRARY_PATH is -not- to blame. /home/<user> is exported to
> each machine from the master, and each machine uses the same image (and thus
> the same paths). If there was a problem with the path, it would not run.
>
> Any insight would be appreciated.
>
> Thank you,
> Ethan
>
>
>
> --
> Dr. Ethan Deneault
> Assistant Professor of Physics
> SC-234
> University of Tampa
> Tampa, FL 33615
> Office: (813) 257-3555
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
-- 
David Zhang
University of California, San Diego