Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
From: Steve Kargl (sgk_at_[hidden])
Date: 2011-07-08 13:31:28


On Thu, Jul 07, 2011 at 08:38:56PM -0400, Jeff Squyres wrote:
> On Jul 5, 2011, at 4:24 PM, Steve Kargl wrote:
> > On Tue, Jul 05, 2011 at 01:14:06PM -0700, Steve Kargl wrote:
> >> I have an application that appears to function as I expect
> >> when compiled with openmpi-1.4.2 on FreeBSD 9.0. But, it
> >> appears to hang during communication between nodes. What
> >> follows is the long version.
> >
> > Argh I messed up. It should read "But, it appears to hang
> > during communication between nodes when using 1.4.3 or 1.4.4."
> >
> Are you able to run simple MPI applications with 1.4.3 or 1.4.4
> on your OS? E.g., the "ring_c" program in the example/ directory?
> This might be a good test to see if OMPI's TCP is working at all.
>
> Assuming that works... Have you tried attaching debuggers to see
> where your process is hanging? There might be a logic issue in
> your app that isn't-quite-legal-MPI but works with some amount
> of buffering, but fails if the amount of buffering is reduced.

It seems that openmpi-1.4.4 compiled code is trying to use the
wrong nic. My /etc/hosts file has

10.208.78.111 hpc.apl.washington.edu hpc
192.168.0.10 node10.cimu.org node10 n10 master
192.168.0.11 node11.cimu.org node11 n11
192.168.0.12 node12.cimu.org node12 n12
... down to ...
192.168.0.21 node21.cimu.org node21 n21

Note, node10 and hpc are the same system (2 different NICs).

hpc:kargl[252] /usr/local/openmpi-1.4.4/bin/mpif90 -o z -g -O ring_f90.f90
hpc:kargl[253] cat > mf1
node10 slots=1
node11 slots=1
node12 slots=1
hpc:kargl[254] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf1 ./z
 Process 0 sending 10 to 1 tag 201 ( 3 processes in ring)

in another xterm if I attach to the process on node10, I see
with gdb.

(gdb) bt
#0 0x00000003c10f9b9c in kevent () from /lib/libc.so.7
#1 0x000000000052ca18 in kq_dispatch ()
#2 0x000000000052ba93 in opal_event_base_loop ()
#3 0x000000000052549b in opal_progress ()
#4 0x000000000048fcfc in mca_pml_ob1_send ()
#5 0x0000000000428873 in PMPI_Send ()
#6 0x000000000041a890 in pmpi_send__ ()
#7 0x000000000041a3f0 in ring () at ring_f90.f90:34
#8 0x000000000041a640 in main (argc=<value optimized out>,
    argv=<value optimized out>) at ring_f90.f90:10
#9 0x000000000041a1cc in _start ()
(gdb) quit

Now, eliminating node10 from the machine file, I see:

hpc:kargl[255] cat > mf2
node11 slots=1
node12 slots=1
node13 slots=1
hpc:kargl[256] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf2 ./z
 Process 0 sending 10 to 1 tag 201 ( 3 processes in ring)
 Process 0 sent to 1
 Process 0 decremented value: 9
 Process 0 decremented value: 8
 Process 0 decremented value: 7
 Process 0 decremented value: 6
 Process 0 decremented value: 5
 Process 0 decremented value: 4
 Process 0 decremented value: 3
 Process 0 decremented value: 2
 Process 0 decremented value: 1
 Process 0 decremented value: 0
 Process 0 exiting
 Process 1 exiting
 Process 2 exiting

I also have a simple mpi test program netmpi.c from Argonne.
It shows

hpc:kargl[263] /usr/local/openmpi-1.4.4/bin/mpicc -o z -g -O GetOpt.c netmpi.c
hpc:kargl[264] cat mf_ompi_3
node11.cimu.org slots=1
node16.cimu.org slots=1
hpc:kargl[265] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf_ompi_3 ./z
1: node16.cimu.org
0: node11.cimu.org
Latency: 0.000073617
Sync Time: 0.000147234
Now starting main loop
  0: 0 bytes 16384 times --> 0.00 Mbps in 0.000073612 sec
  1: 1 bytes 16384 times --> 0.10 Mbps in 0.000073612 sec
  2: 2 bytes 3396 times --> 0.21 Mbps in 0.000073611 sec
  3: 3 bytes 1698 times --> 0.31 Mbps in 0.000073609 sec
  4: 5 bytes 2264 times --> 0.52 Mbps in 0.000073610 sec
  5: 7 bytes 1358 times --> 0.73 Mbps in 0.000073608 sec

hpc:kargl[268] cat mf_ompi_1
node10.cimu.org slots=1
node16.cimu.org slots=1
hpc:kargl[267] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf_ompi_1 ./z
0: hpc.apl.washington.edu
1: node16.cimu.org

(gdb) bt
#0 0x00000003c0bedb9c in kevent () from /lib/libc.so.7
#1 0x000000000052d648 in kq_dispatch ()
#2 0x000000000052c6c3 in opal_event_base_loop ()
#3 0x00000000005260cb in opal_progress ()
#4 0x0000000000491d1c in mca_pml_ob1_send ()
#5 0x000000000043c753 in PMPI_Send ()
#6 0x000000000041a112 in Sync (p=0x7fffffffd4d0) at netmpi.c:573
#7 0x000000000041a3cf in DetermineLatencyReps (p=0x3) at netmpi.c:593
#8 0x000000000041a4fe in TestLatency (p=0x3) at netmpi.c:630
#9 0x000000000041a958 in main (argc=1, argv=0x7fffffffd6a0) at netmpi.c:213
(gdb) quit

Why is hpc.apl.washington.edu appearing instead of node10.cimu.org?

-- 
Steve