Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Test works with 3 computers, but not 4?
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-29 16:57:58


Ah, so there is a firewall involved? That is always a problem. I
gather that node 126 has clear access to all other nodes, but nodes
122, 123, and 125 do not all have access to each other?

See if your admin is willing to open at least one port on each node
that can reach all other nodes. It is easiest if it is the same port
for every node, but not required. Then you can try setting the mca
params oob_tcp_port_minv4 and oob_tcp_port_rangev4. This should allow
the daemons to communicate.

Check ompi_info --param oob tcp for info on those (and other) params.

Ralph

On Jul 29, 2009, at 2:46 PM, David Doria wrote:

>
> On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain <rhc_at_[hidden]>
> wrote:
> Using direct can cause scaling issues as every process will open a
> socket to every other process in the job. You would at least have to
> ensure you have enough file descriptors available on every node.
>
> The most likely cause is either (a) a different OMPI version getting
> picked up on one of the nodes, or (b) something blocking
> communication between at least one of your other nodes. I would
> suspect the latter - perhaps a firewall or something?
>
> I''m disturbed by your not seeing any error output - that seems
> strange. Try adding --debug-daemons to the cmd line. That should
> definitely generate output from every daemon (at the least, they
> report they are alive).
>
> Ralph
>
> Nifty, I used MPI_Get_processor_name - as you said, this is much
> more helpful output. I also check all the versions and they seem to
> be fine - 'mpirun -V' says 1.3.3 on all 4 machines.
>
> The output with '-mca routed direct' is now (correctly):
> [doriad_at_daviddoria MPITest]$ mpirun -H
> 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-
> mpi
> Process 0 on daviddoria out of 4
> Process 1 on cloud3 out of 4
> Process 2 on cloud4 out of 4
> Process 3 on cloud6 out of 4
>
> Here is the output with --debug-daemons.
>
> Is there a particular port / set of ports I can have my system admin
> unblock on the firewall to see if that fixes it?
>
> [doriad_at_daviddoria MPITest]$ mpirun -H
> 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached
> --debug-daemons -np 4 hello-mpi
>
> Daemon was launched on cloud3 - beginning to initialize
> Daemon [[9461,0],1] checking in as pid 14707 on host cloud3
> Daemon [[9461,0],1] not using static ports
> [cloud3:14707] [[9461,0],1] orted: up and running - waiting for
> commands!
> Daemon was launched on cloud4 - beginning to initialize
> Daemon [[9461,0],2] checking in as pid 5987 on host cloud4
> Daemon [[9461,0],2] not using static ports
> [cloud4:05987] [[9461,0],2] orted: up and running - waiting for
> commands!
> Daemon was launched on cloud6 - beginning to initialize
> Daemon [[9461,0],3] checking in as pid 1037 on host cloud6
> Daemon [[9461,0],3] not using static ports
> [daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0
> arch ffca0200
> [daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200
> [daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200
> [daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200
> [daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs
> [cloud6:01037] [[9461,0],3] orted: up and running - waiting for
> commands!
> [cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch
> ffca0200
> [cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200
> [cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200
> [cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200
> [cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch
> ffca0200
> [cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200
> [cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200
> [cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200
> [cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs
> [cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs
> [daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap
> from local proc [[9461,1],0]
> [daviddoria:11061] [[9461,0],0] orted_cmd: received collective data
> cmd
> [cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from
> local proc [[9461,1],2]
> [daviddoria:11061] [[9461,0],0] orted_cmd: received collective data
> cmd
> [cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd
>
> Any more thoughts?
>
> Thanks,
>
> David
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users