Ah, so there is a firewall involved? That is always a problem. I gather that node 126 has clear access to all other nodes, but nodes 122, 123, and 125 do not all have access to each other?

See if your admin is willing to open at least one port on each node that can reach all other nodes. It is easiest if it is the same port for every node, but not required. Then you can try setting the mca params oob_tcp_port_minv4 and oob_tcp_port_rangev4. This should allow the daemons to communicate.

Check ompi_info --param oob tcp for info on those (and other) params.

Ralph

On Jul 29, 2009, at 2:46 PM, David Doria wrote:


On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain <rhc@open-mpi.org> wrote:
Using direct can cause scaling issues as every process will open a socket to every other process in the job. You would at least have to ensure you have enough file descriptors available on every node.

The most likely cause is either (a) a different OMPI version getting picked up on one of the nodes, or (b) something blocking communication between at least one of your other nodes. I would suspect the latter - perhaps a firewall or something?

I''m disturbed by your not seeing any error output - that seems strange. Try adding --debug-daemons to the cmd line. That should definitely generate output from every daemon (at the least, they report they are alive).

Ralph

Nifty, I used MPI_Get_processor_name - as you said, this is much more helpful output. I also check all the versions and they seem to be fine - 'mpirun -V' says 1.3.3 on all 4 machines.

The output with '-mca routed direct' is now (correctly):
[doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-mpi
Process 0 on daviddoria out of 4
Process 1 on cloud3 out of 4
Process 2 on cloud4 out of 4
Process 3 on cloud6 out of 4

Here is the output with --debug-daemons.

Is there a particular port / set of ports I can have my system admin unblock on the firewall to see if that fixes it?

[doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached --debug-daemons -np 4 hello-mpi
                                                                                                                                         
Daemon was launched on cloud3 - beginning to initialize
Daemon [[9461,0],1] checking in as pid 14707 on host cloud3
Daemon [[9461,0],1] not using static ports
[cloud3:14707] [[9461,0],1] orted: up and running - waiting for commands!
Daemon was launched on cloud4 - beginning to initialize
Daemon [[9461,0],2] checking in as pid 5987 on host cloud4
Daemon [[9461,0],2] not using static ports
[cloud4:05987] [[9461,0],2] orted: up and running - waiting for commands!
Daemon was launched on cloud6 - beginning to initialize
Daemon [[9461,0],3] checking in as pid 1037 on host cloud6
Daemon [[9461,0],3] not using static ports
[daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200
[daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs
[cloud6:01037] [[9461,0],3] orted: up and running - waiting for commands!
[cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch ffca0200
[cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200
[cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200
[cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch ffca0200
[cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200
[cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200
[cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs
[cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs
[daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap from local proc [[9461,1],0]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from local proc [[9461,1],2]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd

Any more thoughts?

Thanks,

David
 
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users