Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] openib unloaded before last mem dereg
From: Steve Wise (swise_at_[hidden])
Date: 2013-01-28 21:06:47


On 1/28/2013 7:32 PM, Ralph Castain wrote:
> Out of curiosity, could you tell us how you configured OMPI?

./configure --enable-debug --enable-mpirun-prefix-by-default
--prefix=/usr/mpi/gcc/openmpi-1.6.4rc2-dbg

>
> On Jan 28, 2013, at 12:46 PM, Steve Wise <swise_at_[hidden]> wrote:
>
>> On 1/28/2013 2:04 PM, Ralph Castain wrote:
>>> On Jan 28, 2013, at 11:55 AM, Steve Wise <swise_at_[hidden]> wrote:
>>>
>>>> Do you know if the rdmacm CPC is really being used for your connection setup (vs other CPCs supported by IB)? Cuz iwarp only supports rdmacm. Maybe that's the difference?
>>> Dunno for certain, but I expect it is using the OOB cm since I didn't direct it to do anything different. Like I said, I suspect the problem is that the cluster doesn't have iWARP on it.
>> Definitely, or it could be the different CPC used for IWvs IB is tickling the issue.
>>
>>>> Steve.
>>>>
>>>> On 1/28/2013 1:47 PM, Ralph Castain wrote:
>>>>> Nope - still works just fine. I didn't receive that warning at all, and it ran to completion without problem.
>>>>>
>>>>> I suspect the problem is that the system I can use just isn't configured like yours, and so I can't trigger the problem. Afraid I can't be of help after all... :-(
>>>>>
>>>>>
>>>>> On Jan 28, 2013, at 11:25 AM, Steve Wise <swise_at_[hidden]> wrote:
>>>>>
>>>>>> On 1/28/2013 12:48 PM, Ralph Castain wrote:
>>>>>>> Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch (which is the 1.6.4rcN) on the only IB-based cluster I can access.
>>>>>>>
>>>>>>> Can you try it with a 1.6.4 tarball and see if you still see the problem? Could be someone already fixed it.
>>>>>> I still hit it on 1.6.4rc2.
>>>>>>
>>>>>> Note iWARP != IB so you may not have this issue on IB systems for various reasons. Did you use the same mpirun line? Namely using this:
>>>>>>
>>>>>> --mca btl_openib_ipaddr_include "192.168.170.0/24"
>>>>>>
>>>>>> (adjusted to your network config).
>>>>>>
>>>>>> Because if I don't use ipaddr_include, then I don't see this issue on my setup.
>>>>>>
>>>>>> Also, did you see these logged:
>>>>>>
>>>>>> Right after starting the job:
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> No OpenFabrics connection schemes reported that they were able to be
>>>>>> used on a specific port. As such, the openib BTL (OpenFabrics
>>>>>> support) will be disabled for this port.
>>>>>>
>>>>>> Local host: hpc-hn1.ogc.int
>>>>>> Local device: cxgb4_0
>>>>>> Local port: 2
>>>>>> CPCs attempted: oob, xoob, rdmacm
>>>>>> --------------------------------------------------------------------------
>>>>>> ...
>>>>>>
>>>>>> At the end of the job:
>>>>>>
>>>>>> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
>>>>>>
>>>>>>
>>>>>> I think these are benign, but prolly indicate a bug: the mpirun is restricting the job to use port 1 only, so the CPCs shouldn't be attempting port 2...
>>>>>>
>>>>>> Steve.
>>>>>>
>>>>>>
>>>>>>> On Jan 28, 2013, at 10:03 AM, Steve Wise <swise_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> On 1/28/2013 11:48 AM, Ralph Castain wrote:
>>>>>>>>> On Jan 28, 2013, at 9:12 AM, Steve Wise <swise_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> On 1/25/2013 12:19 PM, Steve Wise wrote:
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I'm tracking an issue I see in openmpi-1.6.3. Running this command on my chelsio iwarp/rdma setup causes a seg fault every time:
>>>>>>>>>>>
>>>>>>>>>>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong
>>>>>>>>>>>
>>>>>>>>>>> The segfault is during finalization, and I've debugged this to the point were I see a call to dereg_mem() after the openib blt is unloaded via dlclose(). dereg_mem() dereferences a function pointer to call the btl-specific dereg function, in this case it is openib_dereg_mr(). However, since that btl has already been unloaded, the deref causes a seg fault. Happens every time with the above mpi job.
>>>>>>>>>>>
>>>>>>>>>>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg fault, and I don't see a call to dereg_mem() after the openib btl is unloaded. That's all well good. :) But I'd like to get this fix pushed into 1.6 since that is the current stable release.
>>>>>>>>>>>
>>>>>>>>>>> Question: Can someone point me to the fix in 1.7?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Steve.
>>>>>>>>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which unloads the openib btl. Then further down in ompi_mpi_finalize(), mca_mpool_base_close() is called which ends up calling dereg_mem() which seg faults trying to call into the unloaded openib btl.
>>>>>>>>>>
>>>>>>>>> That definitely sounds like a bug
>>>>>>>>>
>>>>>>>>>> Anybody have thoughts? Anybody care? :)
>>>>>>>>> I care! It needs to be fixed - I'll take a look. Probably something that forgot to be cmr'd.
>>>>>>>> Great! If you want me to try out a fix or gather more debug, just hollar.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Steve.
>>>>>>>>