Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Neil Ludban (nludban_at_[hidden])
Date: 2007-07-18 14:44:48


Good suggestion, increasing the timeout to somewhere around 12
allowed the job to finish. Initial experimentation showed that
I could get a factor of 3-4x improvement in performance using
even larger timeouts, matching the times for 64 processes and
1/4 the data set. The cluster is presently having scheduler
issues, I'll post again if I find anything else interesting.

Thanks-
-Neil

> Date: Tue, 17 Jul 2007 10:14:44 +0300
> From: "Pavel Shamis (Pasha)" <pasha_at_[hidden]>
> Subject: Re: [OMPI devel] InfiniBand timeout errors
> To: Open MPI Developers <devel_at_[hidden]>
> Message-ID: <469C6C64.4040709_at_[hidden]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi,
> Try to increase the IB time out parameter: --mca btl_mvapi_ib_timeout 14
> If the 14 will not work , try to increase little bit more (16)
>
> Thanks,
> Pasha
>
> Neil Ludban wrote:
> > Hi,
> >
> > I'm getting the errors below when calling MPI_Alltoallv() as part of
> > a matrix transpose operation. It's 100% repeatable when testing with
> > 16M matrix elements divided between 64 processes on 32 dual core nodes.
> > There are never any errors with fewer processes or elements, including
> > the same 32 nodes with only one process per node. If anyone wants
> > any additional information or has suggestions to try, please let me
> > know. Otherwise, I'll have the system rebooted and hope the problem
> > goes away.
> >
> > -Neil
> >
> >
> >
> > [0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress]
> > from c065 to: c077 [0,1,3][btl_mvapi_component.c:854:
> > mca_btl_mvapi_component_progress] from c069 error polling HP
> > CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag :
> > 0x2ab6590200 to: c078 error polling HP CQ with status
> > VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380
> > --------------------------------------------------------------------------
> > The retry count is a down counter initialized on creation of the QP. Retry
> > count is defined in the InfiniBand Spec 1.2 (12.7.38):
> > The total number of times that the sender wishes the receiver to retry tim-
> > eout, packet sequence, etc. errors before posting a completion error.
> >
> > Note that two mca parameters are involved here:
> > btl_mvapi_ib_retry_count - The number of times the sender will attempt to
> > retry (defaulted to 7, the maximum value).
> >
> > btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). The
> > actual timeout value used is calculated as:
> > (4.096 micro-seconds * 2^btl_mvapi_ib_timeout).
> > See InfiniBand Spec 1.2 (12.7.34) for more details.