Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] openib btl - fatal errors don't abort the job
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-09-03 08:14:38


On Sep 1, 2010, at 4:47 PM, Steve Wise wrote:

> I was wondering what the logic is behind allowing an MPI job to continue in the presence of a fatal qp error?

It's a feature...?

> Note the "will try to continue" sentence:
>
> --------------------------------------------------------------------------
> The OpenFabrics stack has reported a network error event. Open MPI
> will try to continue, but your job may end up failing.
>
> Local host: escher
> MPI process PID: 19136
> Error number: 1 (IBV_EVENT_QP_FATAL)
>
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.
> --------------------------------------------------------------------------
>
> Due to other bugs I'm chasing, I get these sorts of errors, and I notice the job just hangs even though the connections have been disconnected, the qps flushed, and all pending WRs completed with status == FLUSH.

Would it be better to make it a fatal error? (I'm thinking probably "yes")

Feel free to submit a patch...

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/