Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] openib segfaults with Torque
From: Fischer, Greg A. (fischega_at_[hidden])
Date: 2014-06-10 14:06:54


Jeff/Nathan,

I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I":

        mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> output.txt

Output and backtrace are attached. Let me know if I can provide anything else.

Thanks for looking into this,
Greg

-----Original Message-----
From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff Squyres (jsquyres)
Sent: Tuesday, June 10, 2014 10:31 AM
To: Nathan Hjelm
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Greg:

Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly?

On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:

> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>>
>> From a *very quick* glance at the code, it looks like this might be a simple incorrect-finalization issue. That is:
>>
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single
>> server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified
>> itself early in the initialization process), and something in the
>> finalization process didn't take that into account
>>
>> Nathan -- is that anywhere close to correct?
>
> Nope. udcm_module_finalize is being called because there was an error
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The
> opal_list_t destructor is getting an assert failure. Probably because
> the constructor wasn't called. I can rearrange the constructors to be
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of
> CQs, allocates a small number of registered bufferes and starts
> monitoring the fd for the completion channel. All these things are
> also done in the setup of the openib btl itself. Keep in mind that the
> openib btl will not disqualify itself when running single server.
> Openib may be used to communicate on node and is needed for the dynamics case.
>
> The user might try adding -mca btl_base_verbose 100 to shed some light
> on what the real issue is.
>
> BTW, I no longer monitor the user mailing list. If something needs my
> attention forward it to me directly.
>
> -Nathan

--
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users