Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly?
On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:
> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>> From a *very quick* glance at the code, it looks like this might be a simple incorrect-finalization issue. That is:
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified itself early in the initialization process), and something in the finalization process didn't take that into account
>> Nathan -- is that anywhere close to correct?
> Nope. udcm_module_finalize is being called because there was an error
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The
> opal_list_t destructor is getting an assert failure. Probably because
> the constructor wasn't called. I can rearrange the constructors to be
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of
> CQs, allocates a small number of registered bufferes and starts
> monitoring the fd for the completion channel. All these things are also
> done in the setup of the openib btl itself. Keep in mind that the openib
> btl will not disqualify itself when running single server. Openib may be
> used to communicate on node and is needed for the dynamics case.
> The user might try adding -mca btl_base_verbose 100 to shed some
> light on what the real issue is.
> BTW, I no longer monitor the user mailing list. If something needs my
> attention forward it to me directly.
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/