Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] mpool errors fatal
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-13 21:05:52

I just made a change in the mpool base memory hook callback; please see:

In short, I made the error that Lenny discovered (which turned out to
be an ob1 issue, not a memory hooks issue) in
  be a fatal error rather than just calling opal_output(). So if this
error ever happens again, it'll definitely show up in MTT via a bunch
of failed tests (rather than someone happening to notice some
opal_output's in the middle of a run).

I made the error fatal by calling _exit(), though -- quite
ungraceful. The problem is that this is a void-returning callback in
the middle of the memory allocator; there's no way to pass an error up
higher for better handling. Other options include:

1. We could set a global variable, but then we'd have to notice that
global error at some point later -- there's no real guarantee when
exactly that would happen.
2. We could set a zero-time event to fire that would do some better
cleanup/error handling, but that might (will?) call malloc()
(remember: we're in a callback from the memory allocator, so calling
malloc() may do Bad Things).
3. ...?

However, I think that if this situation arises, we're in a bad place
anyway -- perhaps the most sane thing to do is just exit cleanly.
"Better" error handling might have us exit a bit more cleanly (e.g.,
do some cleanup first), but _exit() will get the job done. And then
ORTE takes over and kills the rest of the job.

*** Note that the old code was calling opal_output() to print the
message, which might (will?) call malloc() anyway, so Bad Things could
well have happened. Meaning that the message may not have actually
gotten printed out -- yoinks. So the "print the message" code had to
be updated anyway. I think the only controversial point in this
commit is that I called _exit().

Comments? Or is calling _exit() the least sucky thing to do here?

Jeff Squyres
Cisco Systems