Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib segfaults with Torque
From: Fischer, Greg A. (fischega_at_[hidden])
Date: 2014-06-10 14:06:54


Jeff/Nathan,

I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I":

        mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> output.txt

Output and backtrace are attached. Let me know if I can provide anything else.

Thanks for looking into this,
Greg

-----Original Message-----
From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff Squyres (jsquyres)
Sent: Tuesday, June 10, 2014 10:31 AM
To: Nathan Hjelm
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Greg:

Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly?

On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:

> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>>
>> From a *very quick* glance at the code, it looks like this might be a simple incorrect-finalization issue. That is:
>>
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single
>> server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified
>> itself early in the initialization process), and something in the
>> finalization process didn't take that into account
>>
>> Nathan -- is that anywhere close to correct?
>
> Nope. udcm_module_finalize is being called because there was an error
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The
> opal_list_t destructor is getting an assert failure. Probably because
> the constructor wasn't called. I can rearrange the constructors to be
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of
> CQs, allocates a small number of registered bufferes and starts
> monitoring the fd for the completion channel. All these things are
> also done in the setup of the openib btl itself. Keep in mind that the
> openib btl will not disqualify itself when running single server.
> Openib may be used to communicate on node and is needed for the dynamics case.
>
> The user might try adding -mca btl_base_verbose 100 to shed some light
> on what the real issue is.
>
> BTW, I no longer monitor the user mailing list. If something needs my
> attention forward it to me directly.
>
> -Nathan

--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users