Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-06-29 13:26:51


Sorry for the delay in replying -- sometimes we just get overwhelmed
with all the incoming mail. :-(

> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of Tony Ladd
> Sent: Saturday, June 17, 2006 9:47 AM
> To: users_at_[hidden]
> Subject: [OMPI users] mca_btl_tcp_frag_send: writev failed
> with errno=110
>
> I am getting the following error with openmpi-1.1b1
>
> mca_btl_tcp_frag_send: writev failed with errno=110

Can you try this with the final released version of 1.1, just to see if
the problem still exists?

110 = ETIMEDOUT, which seems like a strange error to get here, because
the TCP connection should have already been made.
 
> 1) This does not ever happen with other MPI's I have tried
> like MPICH and
> LAM
> 2) It only seems to happen with large numbers of cpus, 32 and
> occasionally
> 16, and with larger messages sizes. In this case it ws 128K.
> 3) It only seems to happen with dual cpus on each node.
> 4) My configuration is default with (in openmpi-mca-params.conf):
> pls_rsh_agent = rsh
> btl = tcp,self
> btl_tcp_if_include = eth1
> I also set --mca btl_tcp_eager_limit 131072 when running the
> program, though
> leaving this out does not eliminate the problem.
>
> My program is a communication test; it sends bidirectional
> point to point
> messages among N cpus. In one test it exchanges messages
> between pairs of
> cpus, in another it reads from the node on its left and sends
> to the node on
> its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE.

Can you share your code and give a recipe for replicating the problem?
 
> Finally: the tcp driver in openmpi seems not nearly as good
> as the one in
> LAM. I got higher throughput with far fewer dropouts with LAM.

This is unfortunately a known issue. The reason for it is that all the
current Open MPI members concentrate mainly on high-speed networks such
as InfiniBand, shared memory, and Myrinet. TCP *works*, and so far that
has been "good enough," but we're all aware that it still needs to be
optimized.

The issue is actually not the protocols that we're using over TCP.
We're pretty sure that it has to do with how Open MPI's file descriptor
progression engine works (disclaimer: we haven't spent a lot of time
trying to categorize this since we've been focusing on the high speed
networks, but we're pretty sure that this is the Big issue).

Internally, we use the software package "libevent" as an engine for fd
and signal progress, but there are some cases that seem to be somewhat
inefficient. We use this progression engine (as opposed to, say, a
dedicated socket state machine in the TCP BTL itself) because we need to
make progress on both the MPI TCP communications and the underlying
run-time environment (ORTE) TCP communications. Hence, we needed a
central "engine" that can handle both.

This is an area that we would love to get some outside help -- it's not
so much a network issues, but more likely a systems issue. None of us
currently have engineering resources to spend time on this; is there
anyone out there in the open source community that could help? If so,
we can provide more details on where we think the bottlenecks are, etc.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems