On Apr 16, 2006, at 1:29 PM, Lee D. Peterson wrote:
> Thanks for your help. The hanging problem came back again a day ago.
> However, I can now run only if I use either "-mca btl_tcp_if_include
> en0" or "-mca btl_tcp_if_include en1". Using btl_tcp_if_exclude on
> either en0 or en1 doesn't work.
That's very strange. What happens if you run with "-mca
btl_tcp_if_include en0,en1", which will use both devices. The fact
that the exclude option doesn't work makes me wonder if there isn't
another device that appears active somewhere in the cluster. The
most likely suspect on an OS X cluster is a firewire device that
somehow has sprouted an address and gotten marked as active. You
might want to run "ifconfig -a" on all your nodes and make sure the
output is mostly the same.
> Regarding the TCP performance, I ran the HPL benchmark again and see
> typically 85% to 90% of the LAM-MPI speed, provided the problem size
> isn't too small.
That would make sense - LAM/MPI can exhibit much better latency in
very specific situations than Open MPI (on TCP - on other
interconnects, Open MPI is much faster). We're working on optimizing
our TCP stack, but up until now, the high-speed interconnects have
been the major concern.
Open MPI developer