Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] problems with MPI_Waitsome/MPI_Allstart and OpenMPI on gigabit and IB networks
From: Joe Landman (landman_at_[hidden])
Date: 2008-07-20 10:45:48


Hi folks:

   This is a deeper dive into the code that was giving me fits over the
last two weeks.

   It uses MPI_Waitsome and MPI_Allstart to launch/monitor progress.
More on that in a moment.

   The testing I have done to date on this platform suggests that
OpenMPI is working fine, though I don't normally exercise these two API
functions. Other MPI codes run without problem. The gigabit and IB
networks are operational, with no issues that I can spot.

   The symptoms:

1) smaller test cases *sometimes* work, and sometimes hang. The hanging
appears (in strace) to be a tight loop in a poll. Changing from default
to btl tcp,self seems to help the code a little, and the test jobs run
repeatedly to conclusion. The same binary, larger code, more CPUs (64
vs 4), does not work, regardless of btl settings.

2) this happens with OpenMPI 1.2.2, 1.2.6, 1.2.7. I will check other
stacks as well, but my hope is to use OpenMPI due to the nice (sane)
interface to SGE.

3) using btl to turn off sm and openib, generates lots of these messages:

[c1-8][0,1,4][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[c1-8][0,1,5][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[c1-8][0,1,6][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[c1-5][0,1,24][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[c1-5][0,1,28][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[c1-11][0,1,41][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[c1-11][0,1,45][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113

The FAQ reports that this is a TCP error and the error number of 113
corresponds to

No route to host at -e line 1.

This is wrong, all the nodes are visible from all the other nodes on a
private subnet. For example:

scalable:~ # pdsh "ping -c 1 c1-8 | grep '64 bytes'"
c1-1: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.126 ms
c1-12: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.067 ms
c1-13: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.127 ms
c1-11: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.084 ms
c1-4: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.090 ms
c1-16: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.116 ms
c1-14: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.076 ms
c1-2: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.113 ms
c1-3: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.065 ms
c1-5: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.127 ms
c1-17: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.046 ms
c1-6: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.073 ms
c1-8: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.020 ms
c1-15: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.109 ms
c1-7: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.075 ms
c1-9: 64 bytes from c1-8.susecluster (192.168.32.8): icmp_seq=1 ttl=64
time=0.098 ms

Basically the problem appears to be that MPI_Waitsome is looping forever
as it can't seem to see posted completions using IB. With TCP, it
appears to have other issues, that are problematic, though don't exhibit
themselves with other tests.

I am not sure if this is a bug in the implementation of MPI_Waitsome,
though the odd behavior differences between the transports and the
scaling observation, suggests some sort of buffer size issue. Are there
any specific things we can do to tweak internal OpenMPI buffer sizes to
  experiment with this? Should I rebuild OpenMPI with -O0? Should I
use the Intel compiler for OpenMPI (using gcc 4.1.2 right now)? Main
code is in fortran and we are using Intel 10.1.015. Are there any tcp
stack issues I should be thinking about to deal with the 113 error (user
would be ok with tcp while we get IB ironed out).

Please advise.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman_at_[hidden]
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615