After taking a look at how epoll is implemented in the Linyux kernel, I
can say with 100% certainty that BLCR will not restore the epoll fd
correctly. I hope to fix that eventually, but have too many other
things on my plate to address is now.
Since I cannot promise how soon BLCR may be able to resolve this
problem, I suggest that Josh continue exploring the alternatives. At
least "opal_event_include" set to "poll" appears to work. It is not
clear to me if the "select" problem is related to BLCR or not.
I am guessing that I don't get a say as to weather the BLCR/epoll
problems should delay the libevent merge, but I trust the rest of you to
determine what is in the best interest of OMPI.
-Paul
Josh Hursey wrote:
> I have some more data from the field.
>
> Leaving "opal_event_include" unset (Default) BLCR would give me the
> following error when trying to restart a 2 process 'noop' MPI
> application:
> ----------------------------
> shell$ ompi-restart ompi_global_snapshot_8587.ckpt
> Restart failed: Bad file descriptor
> Restart failed: Bad file descriptor
> shell$
> ----------------------------
[snip]
--
Paul H. Hargrove PHHargrove_at_[hidden]
Future Technologies Group
HPC Research Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
|