Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] openib btl - fatal errors don't abort the job
From: Steve Wise (swise_at_[hidden])
Date: 2010-09-01 16:47:28

I was wondering what the logic is behind allowing an MPI job to continue
in the presence of a fatal qp error?

Note the "will try to continue" sentence:

The OpenFabrics stack has reported a network error event. Open MPI
will try to continue, but your job may end up failing.

   Local host: escher
   MPI process PID: 19136
   Error number: 1 (IBV_EVENT_QP_FATAL)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.

Due to other bugs I'm chasing, I get these sorts of errors, and I notice
the job just hangs even though the connections have been disconnected,
the qps flushed, and all pending WRs completed with status == FLUSH.