Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Cannot launch slots on more than 2 remote machines
From: Igor (nightonearth_at_[hidden])
Date: 2011-03-28 17:17:43


Thank you for your help! The issue is definitely the firewall. I
guess, since I don't plan on having any communication between "slave"
nodes of my cluster (SPMD with no cross-talk), and it is fairly small,
I'll stick with option 2 for now.

On Mon, Mar 28, 2011 at 3:43 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> It is hanging because your last nodes are not receiving the launch command.
>
> The daemons receive a message from mpirun telling them what to launch. That message is sent via a tree-like routing algorithm. So mpirun sends to the first two daemons, each of which relays it on to some number of daemons, each of which relays it to another number, etc.
>
> What is happening here is that the first pair of daemons are not relaying the message on to the next layer. You can try a couple of things:
>
> 1. ensure that it is possible for a daemon on one node to open a TCP socket to any other node - i.e., that a daemon on cluster1 (for example) can open a socket to cluster5 and send a message across. You might have a firewall in the way, or some other prohibition blocking this connection.
>
> 2. given the small size of the cluster, add "-mca routed direct" to your command line. This will tell mpirun to talk directly to each daemon. However, note that your job may still fail as the procs won't be able to open sockets to their peers to send MPI messages, if you use TCP for the MPI transport.
>
> Ralph
>
> On Mar 28, 2011, at 1:24 PM, Igor wrote:
>
>> Hello,
>>
>> First off, complete MPI newbie here. I have installed
>> openmpi-1.4.3-1.fc13.i686 on an IBM blade cluster running Fedora. I
>> can open as many slots as I want on remote machines, as long as I only
>> connect to two machines (doesn't matter which two).
>>
>> For example, I run my mpi task from "cluster" and if my hostfile is:
>>
>> cluster slots=1 max-slots=1
>> cluster3 slots=1
>> cluster5 slots=1
>> cluster1 slots=1
>>
>> If I now run:
>> [username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 3 --hostfile
>> /home/username/.mpi_hostfile hostname
>>
>> The output is
>> cluster.mydomain.ca
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>>
>> If I run:
>> [username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun -np 4 --hostfile
>> /home/username/.mpi_hostfile hostname
>> I expect to see:
>> cluster.mydomain.ca
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>> cluster1.mydomain.ca
>>
>> Instead, I see the same output as when running 3 processes (-np 3),
>> and the task hangs.
>>
>> Below is the output when I run mpirun with --debug-daemons tag. The
>> same behaviour is seen, the process hangs when "-np 4" is requested:
>>
>> ################################
>> [username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np
>> 3 --hostfile /home/username/.mpi_hostfile hostname
>> Daemon was launched on cluster3.mydomain.ca - beginning to initialize
>> Daemon was launched on cluster5.mydomain.ca - beginning to initialize
>> Daemon [[12927,0],1] checking in as pid 3096 on host cluster3.mydomain.ca
>> Daemon [[12927,0],1] not using static ports
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted: up and running -
>> waiting for commands!
>> Daemon [[12927,0],2] checking in as pid 11301 on host cluster5.mydomain.ca
>> Daemon [[12927,0],2] not using static ports
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted: up and running -
>> waiting for commands!
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[1].name cluster3 daemon
>> 1 arch ffca0200
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[2].name cluster5 daemon
>> 2 arch ffca0200
>> [cluster.mydomain.ca:12279] [[12927,0],0] node[3].name cluster1 daemon
>> INVALID arch ffca0200
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received add_local_procs
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] node[3].name cluster1
>> daemon INVALID arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster5.mydomain.ca:11301] [[12927,0],2] node[3].name cluster1
>> daemon INVALID arch ffca0200
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received add_local_procs
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received add_local_procs
>> cluster.mydomain.ca
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received waitpid_fired cmd
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received iof_complete cmd
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received waitpid_fired cmd
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received waitpid_fired cmd
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received iof_complete cmd
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received iof_complete cmd
>> [cluster.mydomain.ca:12279] [[12927,0],0] orted_cmd: received exit
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted_cmd: received exit
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted_cmd: received exit
>> [cluster3.mydomain.ca:03096] [[12927,0],1] orted: finalizing
>> [cluster5.mydomain.ca:11301] [[12927,0],2] orted: finalizing
>>
>> ################################
>> [username_at_cluster ~]$ /usr/lib/openmpi/bin/mpirun --debug-daemons -np
>> 4 --hostfile /home/username/.mpi_hostfile hostname
>> Daemon was launched on cluster5.mydomain.ca - beginning to initialize
>> Daemon was launched on cluster3.mydomain.ca - beginning to initialize
>> Daemon [[12919,0],2] checking in as pid 11325 on host cluster5.mydomain.ca
>> Daemon [[12919,0],2] not using static ports
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted: up and running -
>> waiting for commands!
>> Daemon was launched on cluster1.mydomain.ca - beginning to initialize
>> Daemon [[12919,0],1] checking in as pid 3120 on host cluster3.mydomain.ca
>> Daemon [[12919,0],1] not using static ports
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted: up and running -
>> waiting for commands!
>> Daemon [[12919,0],3] checking in as pid 5623 on host cluster1.mydomain.ca
>> Daemon [[12919,0],3] not using static ports
>> [cluster1.mydomain.ca:05623] [[12919,0],3] orted: up and running -
>> waiting for commands!
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[1].name cluster3 daemon
>> 1 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[2].name cluster5 daemon
>> 2 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] node[3].name cluster1 daemon
>> 3 arch ffca0200
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received add_local_procs
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster5.mydomain.ca:11325] [[12919,0],2] node[3].name cluster1
>> daemon 3 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[0].name cluster daemon
>> 0 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[1].name cluster3
>> daemon 1 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[2].name cluster5
>> daemon 2 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] node[3].name cluster1
>> daemon 3 arch ffca0200
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received add_local_procs
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received add_local_procs
>> cluster.mydomain.ca
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received waitpid_fired cmd
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received iof_complete cmd
>> cluster3.mydomain.ca
>> cluster5.mydomain.ca
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received waitpid_fired cmd
>> [cluster3.mydomain.ca:03120] [[12919,0],1] orted_cmd: received iof_complete cmd
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received waitpid_fired cmd
>> [cluster5.mydomain.ca:11325] [[12919,0],2] orted_cmd: received iof_complete cmd
>> <<<<<<<<<<<<THE PROCESS HANGS HERE>>>>>>>>>>>>
>> ^CKilled by signal 2.
>> Killed by signal 2.
>> Killed by signal 2.
>> --------------------------------------------------------------------------
>> A daemon (pid 12288) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> [cluster.mydomain.ca:12287] [[12919,0],0] orted_cmd: received exit
>> mpirun: clean termination accomplished
>>
>> ################################
>>
>> Notes:
>> 1. Passwordless ssh login between all cluster# machines works fine.
>> 2. It doesn't matter which two machines I specify in .mpi_hostfile. I
>> can always connect to 1 or 2 of them, and get the freeze when I try 3
>> or more.
>> 3. I installed Open MPI using the yum installer of Fedora. By default,
>> it chose /usr/lib/openmpi/ as the install directory, instead of the
>> /opt/openmpi-... that is mentioned throughout the Open MPI FAQ. I
>> can't imagine that to be a problem...
>> 4. Supplying PATH and LD_LIBRARY_PATH: The Open MPI FAQ says
>> "specifying the absolute pathname to mpirun is equivalent to using the
>> --prefix argument", so that's what I chose, after reading all the
>> scaremongering about modifying LD_LIBRARY_PATH :) Adding
>> "/usr/lib/openmpi/lib" to the otherwise empty LD_LIBRARY_PATH produces
>> same results.
>>
>> Can someone suggest a possible solution or at least a direction in
>> which I should continue my troubleshooting?
>>
>> --
>>
>> Thank you all for your time,
>> Igor
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Regards,
Igor