On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote:
> I seem to recall that you have an IB-based cluster, right?
> From a *very quick* glance at the code, it looks like this might be a simple incorrect-finalization issue. That is:
> - you run the job on a single server
> - openib disqualifies itself because you're running on a single server
> - openib then goes to finalize/close itself
> - but openib didn't fully initialize itself (because it disqualified itself early in the initialization process), and something in the finalization process didn't take that into account
> Nathan -- is that anywhere close to correct?
Nope. udcm_module_finalize is being called because there was an error
setting up the udcm state. See btl_openib_connect_udcm.c:476. The
opal_list_t destructor is getting an assert failure. Probably because
the constructor wasn't called. I can rearrange the constructors to be
called first but there appears to be a deeper issue with the user's
system: udcm_module_init should not be failing! It creates a couple of
CQs, allocates a small number of registered bufferes and starts
monitoring the fd for the completion channel. All these things are also
done in the setup of the openib btl itself. Keep in mind that the openib
btl will not disqualify itself when running single server. Openib may be
used to communicate on node and is needed for the dynamics case.
The user might try adding -mca btl_base_verbose 100 to shed some
light on what the real issue is.
BTW, I no longer monitor the user mailing list. If something needs my
attention forward it to me directly.
- application/pgp-signature attachment: stored