Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Test works with 3 computers, but not 4?
From: David Doria (daviddoria+openmpi_at_[hidden])
Date: 2009-07-29 16:46:18


On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Using direct can cause scaling issues as every process will open a socket
> to every other process in the job. You would at least have to ensure you
> have enough file descriptors available on every node.
> The most likely cause is either (a) a different OMPI version getting picked
> up on one of the nodes, or (b) something blocking communication between at
> least one of your other nodes. I would suspect the latter - perhaps a
> firewall or something?
>
> I''m disturbed by your not seeing any error output - that seems strange.
> Try adding --debug-daemons to the cmd line. That should definitely generate
> output from every daemon (at the least, they report they are alive).
>
> Ralph
>

Nifty, I used MPI_Get_processor_name - as you said, this is much more
helpful output. I also check all the versions and they seem to be fine -
'mpirun -V' says 1.3.3 on all 4 machines.

The output with '-mca routed direct' is now (correctly):
[doriad_at_daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-mpi
Process 0 on daviddoria out of 4
Process 1 on cloud3 out of 4
Process 2 on cloud4 out of 4
Process 3 on cloud6 out of 4

Here is the output with --debug-daemons.

Is there a particular port / set of ports I can have my system admin unblock
on the firewall to see if that fixes it?

[doriad_at_daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached
--debug-daemons -np 4 hello-mpi

Daemon was launched on cloud3 - beginning to initialize
Daemon [[9461,0],1] checking in as pid 14707 on host cloud3
Daemon [[9461,0],1] not using static ports
[cloud3:14707] [[9461,0],1] orted: up and running - waiting for commands!
Daemon was launched on cloud4 - beginning to initialize
Daemon [[9461,0],2] checking in as pid 5987 on host cloud4
Daemon [[9461,0],2] not using static ports
[cloud4:05987] [[9461,0],2] orted: up and running - waiting for commands!
Daemon was launched on cloud6 - beginning to initialize
Daemon [[9461,0],3] checking in as pid 1037 on host cloud6
Daemon [[9461,0],3] not using static ports
[daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0 arch
ffca0200
[daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200
[daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs
[cloud6:01037] [[9461,0],3] orted: up and running - waiting for commands!
[cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch ffca0200
[cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200
[cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200
[cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch ffca0200
[cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200
[cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200
[cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs
[cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs
[daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap from local
proc [[9461,1],0]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from local proc
[[9461,1],2]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd

Any more thoughts?

Thanks,

David