Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
From: Steve Kargl (sgk_at_[hidden])
Date: 2011-07-08 14:48:42


On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote:
> On Jul 8, 2011, at 1:31 PM, Steve Kargl wrote:
>
> > It seems that openmpi-1.4.4 compiled code is trying to use the
> > wrong nic. My /etc/hosts file has
> >
> > 10.208.78.111 hpc.apl.washington.edu hpc
> > 192.168.0.10 node10.cimu.org node10 n10 master
> > 192.168.0.11 node11.cimu.org node11 n11
> > 192.168.0.12 node12.cimu.org node12 n12
> > ... down to ...
> > 192.168.0.21 node21.cimu.org node21 n21
> >
> > Note, node10 and hpc are the same system (2 different NICs).
>
> Don't confuse the machinefile with the NICs that OMPI will try
> to use. The machinefile is only hosts on which OMPI will launch.
> Specifically: the machinefile does not influence which NICs OMPI
> will use for MPI communications.

Ah, okay. I did not realize that a machinefile did not
limit OMPI to a set of IP address.

> > hpc:kargl[268] cat mf_ompi_1
> > node10.cimu.org slots=1
> > node16.cimu.org slots=1
> > hpc:kargl[267] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf_ompi_1 ./z
> > 0: hpc.apl.washington.edu
> > 1: node16.cimu.org
>
> What function is netmpi.c using to get the hostname that
> is printed? It might be using MPI_Get_processor_name()
> or gethostname() -- both of which may return whatever hostname(1) returns.

After reading the code, this appears to have misled me. The
code uses MPI_Get_processor_name().

> > (gdb) bt
> > #0 0x00000003c0bedb9c in kevent () from /lib/libc.so.7
> > #1 0x000000000052d648 in kq_dispatch ()
> > #2 0x000000000052c6c3 in opal_event_base_loop ()
> > #3 0x00000000005260cb in opal_progress ()
> > #4 0x0000000000491d1c in mca_pml_ob1_send ()
> > #5 0x000000000043c753 in PMPI_Send ()
> > #6 0x000000000041a112 in Sync (p=0x7fffffffd4d0) at netmpi.c:573
> > #7 0x000000000041a3cf in DetermineLatencyReps (p=0x3) at netmpi.c:593
> > #8 0x000000000041a4fe in TestLatency (p=0x3) at netmpi.c:630
> > #9 0x000000000041a958 in main (argc=1, argv=0x7fffffffd6a0) at netmpi.c:213
> > (gdb) quit
>
> The easiest way to fix this is likely to use the btl_tcp_if_include
> or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly which
> interfaces to use:
>
> http://www.open-mpi.org/faq/?category=tcp#tcp-selection

Thanks for the pointer. I'll try this solution later.

> Hypothetically, however, OMPI should be able to determine that
> 192.168.0.x is not reachable from the 10.x network (assuming
> your netmasks are set right), and automatically not use the
> 10.x network to reach any of the non-node10 machines.

The assumption is correct. 192.x is independent of 10.x.

> It's curious that this is not happening; I wonder if this
> is some kind of quirk of OMPI's reachability algorithms
> (http://www.open-mpi.org/faq/?category=tcp#tcp-routability)
> on FreeBSD...?

I just rebuilt 1.4.4rc2 with '-O -g' to get debugging symbols
into openmpi's libraries and executables. Is there any
particulare function(s) that I should inspect?

-- 
Steve