Wow. I thought in the first place that all combinations would be equivalent, but in fact, this is not the case...
I've kept the firewalls down during all the tests.
> - on node1, "mpirun --host node1,node2 ring_c"
Works.
> - on node1, "mpirun --host node1,node3 ring_c"
> - on node1, "mpirun --host node2,node3 ring_c"
Blocks after "Process 0 sent to 1".
> - on node1, "mpirun --host node1,node2,node3 ring_c"
"Process 0 sending 10 to 1, tag 201 (3 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9" then blocks
> Repeat all 4 from node2.
On node 2 :
- "mpirun --host node2,node1 ring_c" : OK
- "mpirun --host node2,node3 ring_c" : blocks at same point that above.
- "mpirun --host node1,node3 ring_c" : blocks at same point that above.
- "mpirun --host node1,node2,node3 ring_c" : blocks at same point that mentioned above in case of 3 hosts.
I recompiled this test program with MPICH2 and have the exactly same issues at the same time.
There is really something wrong with that network...
--
Benjamin Bouvier
|