Good suggestion, increasing the timeout to somewhere around 12
allowed the job to finish. Initial experimentation showed that
I could get a factor of 3-4x improvement in performance using
even larger timeouts, matching the times for 64 processes and
1/4 the data set. The cluster is presently having scheduler
issues, I'll post again if I find anything else interesting.
> Date: Tue, 17 Jul 2007 10:14:44 +0300
> From: "Pavel Shamis (Pasha)" <pasha_at_[hidden]>
> Subject: Re: [OMPI devel] InfiniBand timeout errors
> To: Open MPI Developers <devel_at_[hidden]>
> Message-ID: <469C6C64.4040709_at_[hidden]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> Try to increase the IB time out parameter: --mca btl_mvapi_ib_timeout 14
> If the 14 will not work , try to increase little bit more (16)
> Neil Ludban wrote:
> > Hi,
> > I'm getting the errors below when calling MPI_Alltoallv() as part of
> > a matrix transpose operation. It's 100% repeatable when testing with
> > 16M matrix elements divided between 64 processes on 32 dual core nodes.
> > There are never any errors with fewer processes or elements, including
> > the same 32 nodes with only one process per node. If anyone wants
> > any additional information or has suggestions to try, please let me
> > know. Otherwise, I'll have the system rebooted and hope the problem
> > goes away.
> > -Neil
> > [0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress]
> > from c065 to: c077 [0,1,3][btl_mvapi_component.c:854:
> > mca_btl_mvapi_component_progress] from c069 error polling HP
> > CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag :
> > 0x2ab6590200 to: c078 error polling HP CQ with status
> > VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380
> > --------------------------------------------------------------------------
> > The retry count is a down counter initialized on creation of the QP. Retry
> > count is defined in the InfiniBand Spec 1.2 (12.7.38):
> > The total number of times that the sender wishes the receiver to retry tim-
> > eout, packet sequence, etc. errors before posting a completion error.
> > Note that two mca parameters are involved here:
> > btl_mvapi_ib_retry_count - The number of times the sender will attempt to
> > retry (defaulted to 7, the maximum value).
> > btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). The
> > actual timeout value used is calculated as:
> > (4.096 micro-seconds * 2^btl_mvapi_ib_timeout).
> > See InfiniBand Spec 1.2 (12.7.34) for more details.