Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r17307
From: Tim Prins (tprins_at_[hidden])
Date: 2008-02-01 11:40:20


Adrian,

For the most part this seems to work for me. But there are a few issues.
I'm not sure which are introduced by this patch, and whether some may be
expected behavior. But for completeness I will point them all out.
First, let me explain I am working on a machine with 3 tcp interfaces,
lo, eth0, and ib0. Both eth0 and ib0 connect all the compute nodes.

1. There are some warnings when compiling:
btl_tcp_proc.c:171: warning: no previous prototype for 'evaluate_assignment'
btl_tcp_proc.c:206: warning: no previous prototype for 'visit'
btl_tcp_proc.c:224: warning: no previous prototype for
'mca_btl_tcp_initialise_interface'
btl_tcp_proc.c: In function `mca_btl_tcp_proc_insert':
btl_tcp_proc.c:304: warning: pointer targets in passing arg 2 of
`opal_ifindextomask' differ in signedness
btl_tcp_proc.c:313: warning: pointer targets in passing arg 2 of
`opal_ifindextomask' differ in signedness
btl_tcp_proc.c:389: warning: comparison between signed and unsigned
btl_tcp_proc.c:400: warning: comparison between signed and unsigned
btl_tcp_proc.c:401: warning: comparison between signed and unsigned
btl_tcp_proc.c:459: warning: ISO C90 forbids variable-size array `a'
btl_tcp_proc.c:459: warning: ISO C90 forbids mixed declarations and code
btl_tcp_proc.c:465: warning: ISO C90 forbids mixed declarations and code
btl_tcp_proc.c:466: warning: comparison between signed and unsigned
btl_tcp_proc.c:480: warning: comparison between signed and unsigned
btl_tcp_proc.c:485: warning: comparison between signed and unsigned
btl_tcp_proc.c:495: warning: comparison between signed and unsigned

2. If I exclude all my tcp interfaces, the connection fails properly,
but I do get a malloc request for 0 bytes:
tprins_at_odin examples]$ mpirun -mca btl tcp,self -mca btl_tcp_if_exclude
eth0,ib0,lo -np 2 ./ring_c
malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)
malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)
<snip>

3. If the exclude list does not contain 'lo', or the include list
contains 'lo', the job hangs when using multiple nodes:
[tprins_at_odin examples]$ mpirun -mca btl tcp,self -mca
btl_tcp_if_exclude ib0 -np 2 -bynode ./ring_cProcess 0 sending 10 to 1,
tag 201 (2 processes in ring)
[odin011][1,0][btl_tcp_endpoint.c:619:mca_btl_tcp_endpoint_complete_connect]
connect() failed: Connection refused (111)
<hang>
[tprins_at_odin examples]$ mpirun -mca btl tcp,self -mca
btl_tcp_if_include eth0,lo -np 2 -bynode ./ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[odin011][1,0][btl_tcp_endpoint.c:619:mca_btl_tcp_endpoint_complete_connect]
connect() failed: Connection refused (111)
<hang>

However, the great news about this patch is that it appears to fix
https://svn.open-mpi.org/trac/ompi/ticket/1027 for me.

Hope this helps,

Tim

Adrian Knoth wrote:
> On Wed, Jan 30, 2008 at 06:48:54PM +0100, Adrian Knoth wrote:
>
>>> What is the real issue behind this whole discussion?
>> Hanging connections.
>> I'll have a look at it tomorrow.
>
> To everybody who's interested in BTL-TCP, especially George and (to a
> minor degree) rhc:
>
> I've integrated something what I call "magic address selection code".
> See the comments in r17348.
>
> Can you check
>
> https://svn.open-mpi.org/svn/ompi/tmp-public/btl-tcp
>
> if it's working for you? Read: multi-rail TCP, FNN, whatever is
> important to you?
>
>
> The code is proof of concept and could use a little tuning (if it's
> working at all. Over here, it satisfies all tests).
>
> I vaguely remember that at least Ralph doesn't like
>
> int a[perm_size * sizeof(int)];
>
> where perm_size is dynamically evaluated (read: array size is runtime
> dependent)
>
> There are also some large arrays, search for MAX_KERNEL_INTERFACE_INDEX.
> Perhaps it's better to replace them with an appropriate OMPI data
> structure. I don't know what fits best, you guys know the details...
>
>
> So please give the code a try, and if it's working, feel free to cleanup
> whatever is necessary to make it the OMPI style or give me some pointers
> what to change.
>
>
> I'd like to point to Thomas' diploma thesis. The PDF explains the theory
> behind the code, it's like an rationale. Unfortunately, the PDF has some
> typos, but I guess you'll get the idea. It's a graph matching algorithm,
> Chapter 3 covers everything in detail:
>
> http://cluster.inf-ra.uni-jena.de/~adi/peiselt-thesis.pdf
>
>
> HTH
>