Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
From: David Zhang (solarbikedz_at_[hidden])
Date: 2010-09-20 21:58:23


I don't know if this will help, but try
mpirun --machinefile testfile -np 4 ./test.out
for running 4 processes

On Mon, Sep 20, 2010 at 3:00 PM, Ethan Deneault <edeneault_at_[hidden]> wrote:

> All,
>
> I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
> /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi,
> but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set
> correctly; because the test program does compile and run.
>
> The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a
> AMD x86_64 machine which serves the diskless node images and /home as an NFS
> mount. I compile all of my programs as 32-bit.
>
> My code is a simple hello world:
> $ more test.f
> program test
>
> include 'mpif.h'
> integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
>
> call MPI_INIT(ierror)
> call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
> call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
> print*, 'node', rank, ': Hello world'
> call MPI_FINALIZE(ierror)
> end
>
> If I run this program with:
>
> $ mpirun --machinefile testfile ./test.out
> node 0 : Hello world
> node 2 : Hello world
> node 1 : Hello world
>
> This is the expected output. Here, testfile contains the master node:
> 'pleiades', and two slave nodes: 'taygeta' and 'm43'
>
> If I add another machine to testfile, say 'asterope', it hangs until I
> ctrl-c it. I have tried every machine, and as long as I do not include more
> than 3 hosts, the program will not hang.
>
> I have run the debug-daemons flag with it as well, and I don't see what is
> wrong specifically.
>
> Working output: pleiades (master) and 2 nodes.
>
> $ mpirun --debug-daemons --machinefile testfile ./test.out
> Daemon was launched on m43 - beginning to initialize
> Daemon was launched on taygeta - beginning to initialize
> Daemon [[46344,0],2] checking in as pid 2140 on host m43
> Daemon [[46344,0],2] not using static ports
> [m43:02140] [[46344,0],2] orted: up and running - waiting for commands!
> [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200
> [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200
> [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs
> [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200
> [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200
> [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs
> Daemon [[46344,0],1] checking in as pid 2317 on host taygeta
> Daemon [[46344,0],1] not using static ports
> [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands!
> [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200
> [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200
> [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs
> [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local
> proc [[46344,1],0]
> [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc
> [[46344,1],2]
> [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local
> proc [[46344,1],1]
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
> node 0 : Hello world
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> node 2 : Hello world
> node 1 : Hello world
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs
> [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs
> [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd
> [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs
> [pleiades:19178] [[46344,0],0] orted_recv: received sync from local proc
> [[46344,1],0]
> [m43:02140] [[46344,0],2] orted_recv: received sync from local proc
> [[46344,1],2]
> [taygeta:02317] [[46344,0],1] orted_recv: received sync from local proc
> [[46344,1],1]
> [pleiades:19178] [[46344,0],0] orted_cmd: received waitpid_fired cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received iof_complete cmd
> [m43:02140] [[46344,0],2] orted_cmd: received waitpid_fired cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received waitpid_fired cmd
> [m43:02140] [[46344,0],2] orted_cmd: received iof_complete cmd
> [taygeta:02317] [[46344,0],1] orted_cmd: received iof_complete cmd
> [pleiades:19178] [[46344,0],0] orted_cmd: received exit
> [taygeta:02317] [[46344,0],1] orted_cmd: received exit
> [taygeta:02317] [[46344,0],1] orted: finalizing
> [m43:02140] [[46344,0],2] orted_cmd: received exit
> [m43:02140] [[46344,0],2] orted: finalizing
>
> Not working output: pleiades (master) and 3 nodes:
>
> $ mpirun --debug-daemons --machinefile testfile ./test.out
> Daemon was launched on m43 - beginning to initialize
> Daemon was launched on taygeta - beginning to initialize
> Daemon was launched on asterope - beginning to initialize
> Daemon [[46357,0],2] checking in as pid 2181 on host m43
> Daemon [[46357,0],2] not using static ports
> [m43:02181] [[46357,0],2] orted: up and running - waiting for commands!
> Daemon [[46357,0],1] checking in as pid 2358 on host taygeta
> Daemon [[46357,0],1] not using static ports
> [taygeta:02358] [[46357,0],1] orted: up and running - waiting for commands!
> [pleiades:19191] [[46357,0],0] node[0].name pleiades daemon 0 arch ffca0200
> [pleiades:19191] [[46357,0],0] node[1].name taygeta daemon 1 arch ffca0200
> [pleiades:19191] [[46357,0],0] node[2].name m43 daemon 2 arch ffca0200
> [pleiades:19191] [[46357,0],0] node[3].name asterope daemon 3 arch ffca0200
> [pleiades:19191] [[46357,0],0] orted_cmd: received add_local_procs
> [taygeta:02358] [[46357,0],1] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02358] [[46357,0],1] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02181] [[46357,0],2] node[0].name pleiades daemon 0 arch ffca0200
> [taygeta:02358] [[46357,0],1] node[2].name m43 daemon 2 arch ffca0200
> [m43:02181] [[46357,0],2] node[1].name taygeta daemon 1 arch ffca0200
> [m43:02181] [[46357,0],2] node[2].name m43 daemon 2 arch ffca0200
> [m43:02181] [[46357,0],2] node[3].name asterope daemon 3 arch ffca0200
> [m43:02181] [[46357,0],2] orted_cmd: received add_local_procs
> [taygeta:02358] [[46357,0],1] node[3].name asterope daemon 3 arch ffca0200
> [taygeta:02358] [[46357,0],1] orted_cmd: received add_local_procs
> Daemon [[46357,0],3] checking in as pid 1965 on host asterope
> Daemon [[46357,0],3] not using static ports
> [asterope:01965] [[46357,0],3] orted: up and running - waiting for
> commands!
> [pleiades:19191] [[46357,0],0] orted_recv: received sync+nidmap from local
> proc [[46357,1],0]
> [m43:02181] [[46357,0],2] orted_recv: received sync+nidmap from local proc
> [[46357,1],2]
> [pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd
> [m43:02181] [[46357,0],2] orted_cmd: received collective data cmd
> [pleiades:19191] [[46357,0],0] orted_cmd: received collective data cmd
>
> ------------------
> The output hangs here.
>
> After I kill the process, I get the following output:
> ------------------
>
> Killed by signal 2.
> Killed by signal 2.
> --------------------------------------------------------------------------
> A daemon (pid 19194) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> mpirun: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
> Killed by signal 2.
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> [pleiades:19191] [[46357,0],0] orted_cmd: received waitpid_fired cmd
> [pleiades:19191] [[46357,0],0] orted_cmd: received iof_complete cmd
> [pleiades:19191] [[46357,0],0] orted_cmd: received exit
> mpirun: clean termination accomplished
>
> I know that LD_LIBRARY_PATH is -not- to blame. /home/<user> is exported to
> each machine from the master, and each machine uses the same image (and thus
> the same paths). If there was a problem with the path, it would not run.
>
> Any insight would be appreciated.
>
> Thank you,
> Ethan
>
>
>
> --
> Dr. Ethan Deneault
> Assistant Professor of Physics
> SC-234
> University of Tampa
> Tampa, FL 33615
> Office: (813) 257-3555
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
David Zhang
University of California, San Diego