Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bug in oob_tcp_[in|ex]clude?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-12-18 14:26:40


On Dec 18, 2007, at 11:12 AM, Marco Sbrighi wrote:

>> Assumedly this(these) statement(s) are in a config file that is being
>> read by Open MPI, such as $HOME/.openmpi/mca-params.conf?
>
> I've tried many combinations: only in $HOME/.openmpi/mca-params.conf,
> only in command line and both; but none seems to work correctly.
> Nevertheless, what I'm expecting is that if something is specified in
> $HOME/.openmpi/mca-params.conf, then if differently specified in
> command
> line, the last should be assumed, I think.

The only difference in putting values in these locations should be the
order of precedence in which they are read. As you stated, values on
the command line override everything else. See http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
.

>> Yes, it does. Specifying the MCA same param twice on the command
>> line
>> results in undefined behavior -- it will only take one of them, and I
>> assume it'll take the first (but I'd have to check the code to be
>> sure).
>
> OK, I can obtain the same behaviour using only one statement:
> --mca oob_tcp_include eth1,lo,eth0,ib0,ib1

FWIW, I traced the history of this code -- it looks like it dates all
the way back to LAM/MPI, where if you specify "--mca foo bar --mca foo
yow", then foo will get the value "bar,yow". So it *is* intended
(albeit undocumented!) behavior. Who knew! :-)

> note that using --mca mpi_show_mca_params what I'm seeing in the
> report
> is the same for both statements (twice and single):
>
> .....
> [node255:30188] oob_tcp_debug=0
> [node255:30188] oob_tcp_include=eth1,lo,eth0,ib0,ib1
> [node255:30188] oob_tcp_exclude=
> .......

So far, this is all consistent and expected.

>>> Could you try with 1.2.3 or 1.2.4 (1.2.4 is the most recent; 1.2.5
>>> is
>> due out "soon" -- it *may* get out before the holiday break, but no
>> promises...)?
>
> we have 1.2.3 in another cluster and it performs the same behaviour as
> 1.2.2 .... (BTW the other cluster has the same eth ifaces)

Crud.

>> If you can't upgrade, let me know and I can provide a debugging patch
>> that will give us a little more insight into what is happening on
>> your
>> machines. Thanks.
>
> It is quite difficult for us to upgrade the open-mpi now. We have the
> official CISCO packages installed, and I know the 1.2.2-1 is the only
> official CISCO's open-mpi distribution today ....

Here's a patch to the OMPI 1.2.2 source that adds some printf's in the
OOB TCP interface selection logic that should show exactly what each
process decides. You should be able to run this with as few as 2
processes to see what the decision-making process is for each of them.

11:24] svbu-mpi:/home/jsquyres/openmpi-1.2.2 % diff -u orte/mca/oob/
tcp/oob_tcp.c.orig orte/mca/oob/tcp/oob_tcp.c
--- orte/mca/oob/tcp/oob_tcp.c.orig 2007-12-18 11:21:08.000000000 -0800
+++ orte/mca/oob/tcp/oob_tcp.c 2007-12-18 11:22:29.000000000 -0800
@@ -1344,11 +1344,15 @@
          char name[32];
          opal_ifindextoname(i, name, sizeof(name));
          if (mca_oob_tcp_component.tcp_include != NULL &&
- strstr(mca_oob_tcp_component.tcp_include,name) == NULL)
+ strstr(mca_oob_tcp_component.tcp_include,name) == NULL) {
+ opal_output(0, "TCP OOB skipping %s because it's not in
include (%s)\n", name, mca_oob_tcp_component.tcp_include);
              continue;
+ }
          if (mca_oob_tcp_component.tcp_exclude != NULL &&
- strstr(mca_oob_tcp_component.tcp_exclude,name) != NULL)
+ strstr(mca_oob_tcp_component.tcp_exclude,name) != NULL) {
+ opal_output(0, "TCP OOB skipping %s because it's in
exclude (%s)\n", name, mca_oob_tcp_component.tcp_exclude);
              continue;
+ }
          opal_ifindextoaddr(i, (struct sockaddr*)&addr, sizeof(addr));
          if(opal_ifcount() > 1 &&
             opal_ifislocalhost((struct sockaddr*) &addr))
@@ -1356,6 +1360,7 @@
          if(ptr != contact_info) {
              ptr += sprintf(ptr, ";");
          }
+ opal_output(0, "TCP OOB adding interface: %s\n", name);
          ptr += sprintf(ptr, "tcp://%s:%d", inet_ntoa(addr.sin_addr),
                      ntohs(mca_oob_tcp_component.tcp_listen_port));
      }

I attached the patch as well in case my mail client / the mailing list
munges it.

-- 
Jeff Squyres
Cisco Systems