Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Cannot launch slots on more than 2 remote machines
From: Igor (nightonearth_at_[hidden])
Date: 2011-03-28 15:24:19


Hello,

First off, complete MPI newbie here. I have installed
openmpi-1.4.3-1.fc13.i686 on an IBM blade cluster running Fedora. I
can open as many slots as I want on remote machines, as long as I only
connect to two machines (doesn't matter which two).

For example, I run my mpi task from "cluster" and if my hostfile is:

cluster slots=1 max-slots=1
cluster3 slots=1
cluster5 slots=1
cluster1 slots=1

If I now run:
[username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 3 --hostfile
/home/username/.mpi_hostfile hostname

The output is
cluster.mydomain.ca
cluster3.mydomain.ca
cluster5.mydomain.ca

If I run:
[username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 4 --hostfile
/home/username/.mpi_hostfile hostname
I expect to see:
cluster.mydomain.ca
cluster3.mydomain.ca
cluster5.mydomain.ca
cluster1.mydomain.ca

Instead, I see the same output as when running 3 processes (-np 3),
and the task hangs.

Below is the output when I run mpirun with --debug-daemons tag. The
same behaviour is seen, the process hangs when "-np 4" is requested:

################################
[username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np
3 --hostfile /home/username/.mpi_hostfile hostname
Daemon was launched on cluster3.mydomain.ca - beginning to initialize
Daemon was launched on cluster5.mydomain.ca - beginning to initialize
Daemon [[12927,0],1] checking in as pid 3096 on host cluster3.mydomain.ca
Daemon [[12927,0],1] not using static ports
[cluster3.mydomain.ca:03096] [[12927,0],1] orted: up and running -
waiting for commands!
Daemon [[12927,0],2] checking in as pid 11301 on host cluster5.mydomain.ca
Daemon [[12927,0],2] not using static ports
[cluster.mydomain.ca:12279] [[12927,0],0] node[0].name cluster daemon
0 arch ffca0200
[cluster5.mydomain.ca:11301] [[12927,0],2] orted: up and running -
waiting for commands!
[cluster.mydomain.ca:12279] [[12927,0],0] node[1].name cluster3 daemon
1 arch ffca0200
[cluster.mydomain.ca:12279] [[12927,0],0] node[2].name cluster5 daemon
2 arch ffca0200
[cluster.mydomain.ca:12279] [[12927,0],0] node[3].name cluster1 daemon
INVALID arch ffca0200
[cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received add_local_procs
[cluster3.mydomain.ca:03096] [[12927,0],1] node[0].name cluster daemon
0 arch ffca0200
[cluster3.mydomain.ca:03096] [[12927,0],1] node[1].name cluster3
daemon 1 arch ffca0200
[cluster3.mydomain.ca:03096] [[12927,0],1] node[2].name cluster5
daemon 2 arch ffca0200
[cluster3.mydomain.ca:03096] [[12927,0],1] node[3].name cluster1
daemon INVALID arch ffca0200
[cluster5.mydomain.ca:11301] [[12927,0],2] node[0].name cluster daemon
0 arch ffca0200
[cluster5.mydomain.ca:11301] [[12927,0],2] node[1].name cluster3
daemon 1 arch ffca0200
[cluster5.mydomain.ca:11301] [[12927,0],2] node[2].name cluster5
daemon 2 arch ffca0200
[cluster5.mydomain.ca:11301] [[12927,0],2] node[3].name cluster1
daemon INVALID arch ffca0200
[cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received add_local_procs
[cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received add_local_procs
cluster.mydomain.ca
[cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received waitpid_fired cmd
[cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received iof_complete cmd
cluster3.mydomain.ca
cluster5.mydomain.ca
[cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received waitpid_fired cmd
[cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received waitpid_fired cmd
[cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received iof_complete cmd
[cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received iof_complete cmd
[cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received exit
[cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received exit
[cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received exit
[cluster3.mydomain.ca:03096] [[12927,0],1] orted: finalizing
[cluster5.mydomain.ca:11301] [[12927,0],2] orted: finalizing

################################
[username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np
4 --hostfile /home/username/.mpi_hostfile hostname
Daemon was launched on cluster5.mydomain.ca - beginning to initialize
Daemon was launched on cluster3.mydomain.ca - beginning to initialize
Daemon [[12919,0],2] checking in as pid 11325 on host cluster5.mydomain.ca
Daemon [[12919,0],2] not using static ports
[cluster5.mydomain.ca:11325] [[12919,0],2] orted: up and running -
waiting for commands!
Daemon was launched on cluster1.mydomain.ca - beginning to initialize
Daemon [[12919,0],1] checking in as pid 3120 on host cluster3.mydomain.ca
Daemon [[12919,0],1] not using static ports
[cluster3.mydomain.ca:03120] [[12919,0],1] orted: up and running -
waiting for commands!
Daemon [[12919,0],3] checking in as pid 5623 on host cluster1.mydomain.ca
Daemon [[12919,0],3] not using static ports
[cluster1.mydomain.ca:05623] [[12919,0],3] orted: up and running -
waiting for commands!
[cluster.mydomain.ca:12287] [[12919,0],0] node[0].name cluster daemon
0 arch ffca0200
[cluster.mydomain.ca:12287] [[12919,0],0] node[1].name cluster3 daemon
1 arch ffca0200
[cluster.mydomain.ca:12287] [[12919,0],0] node[2].name cluster5 daemon
2 arch ffca0200
[cluster.mydomain.ca:12287] [[12919,0],0] node[3].name cluster1 daemon
3 arch ffca0200
[cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received add_local_procs
[cluster5.mydomain.ca:11325] [[12919,0],2] node[0].name cluster daemon
0 arch ffca0200
[cluster5.mydomain.ca:11325] [[12919,0],2] node[1].name cluster3
daemon 1 arch ffca0200
[cluster5.mydomain.ca:11325] [[12919,0],2] node[2].name cluster5
daemon 2 arch ffca0200
[cluster5.mydomain.ca:11325] [[12919,0],2] node[3].name cluster1
daemon 3 arch ffca0200
[cluster3.mydomain.ca:03120] [[12919,0],1] node[0].name cluster daemon
0 arch ffca0200
[cluster3.mydomain.ca:03120] [[12919,0],1] node[1].name cluster3
daemon 1 arch ffca0200
[cluster3.mydomain.ca:03120] [[12919,0],1] node[2].name cluster5
daemon 2 arch ffca0200
[cluster3.mydomain.ca:03120] [[12919,0],1] node[3].name cluster1
daemon 3 arch ffca0200
[cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received add_local_procs
[cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received add_local_procs
cluster.mydomain.ca
[cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received waitpid_fired cmd
[cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received iof_complete cmd
cluster3.mydomain.ca
cluster5.mydomain.ca
[cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received waitpid_fired cmd
[cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received iof_complete cmd
[cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received waitpid_fired cmd
[cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received iof_complete cmd
<<<<<<<<<<<<THE PROCESS HANGS HERE>>>>>>>>>>>>
^CKilled by signal 2.
Killed by signal 2.
Killed by signal 2.
--------------------------------------------------------------------------
A daemon (pid 12288) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
[cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received exit
mpirun: clean termination accomplished

################################

Notes:
1. Passwordless ssh login between all cluster# machines works fine.
2. It doesn't matter which two machines I specify in .mpi_hostfile. I
can always connect to 1 or 2 of them, and get the freeze when I try 3
or more.
3. I installed Open MPI using the yum installer of Fedora. By default,
it chose /usr/lib/openmpi/ as the install directory, instead of the
/opt/openmpi-... that is mentioned throughout the Open MPI FAQ. I
can't imagine that to be a problem...
4. Supplying PATH and LD_LIBRARY_PATH: The Open MPI FAQ says
"specifying the absolute pathname to mpirun is equivalent to using the
--prefix argument", so that's what I chose, after reading all the
scaremongering about modifying LD_LIBRARY_PATH :) Adding
"/usr/lib/openmpi/lib" to the otherwise empty LD_LIBRARY_PATH produces
same results.

Can someone suggest a possible solution or at least a direction in
which I should continue my troubleshooting?

-- 
Thank you all for your time,
Igor