Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r26077 (fwd)
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2012-03-01 18:46:43


I can confirm that neither leak is causing my imb hang. Unless there is another frag leak somewhere (haven't found one) the lockup was simply due to running out of registered memory. So, I see no need to push for a 1.4.6 unless a btl other than ugni hits the bug.

Setting an rcache limit doesn't eliminate the hang. I will continue to investigate next week.

-Nathan

On Thu, 1 Mar 2012, Jeffrey Squyres wrote:

> ...or in 1.5.5.
>
> How soon will you be able to tell if it fixes some hangs?
>
>
> On Mar 1, 2012, at 10:56 AM, Nathan Hjelm wrote:
>
>> Found a pretty nasty frag leak (and a minor one) in ob1 (see commit below). If this fix addresses some hangs we are seeing on infiniband LANL might want a 1.4.6 rolled (or a faster rollout for 1.6.0).
>>
>> -Nathan
>>
>> ---------- Forwarded message ----------
>> Date: Thu, 1 Mar 2012 08:53:39 -0700
>> From: hjelmn_at_[hidden]
>> Reply-To: devel_at_[hidden]
>> To: svn_at_[hidden]
>> Subject: [OMPI svn] svn:open-mpi r26077
>>
>> Author: hjelmn
>> Date: 2012-03-01 10:53:39 EST (Thu, 01 Mar 2012)
>> New Revision: 26077
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/26077
>>
>> Log:
>> ob1: fix two fragment leaks
>> - MAJOR! get src descriptor leaks if mca_bml_base_send fails
>> - minor. descriptor leaked in mca_pml_send_request_start_copy if the btl returns OMPI_ERR_RESOURCE_BUSY.
>> Text files modified:
>> trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c | 27 ++++++++++++++++-----------
>> 1 files changed, 16 insertions(+), 11 deletions(-)
>>
>> Modified: trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c
>> ==============================================================================
>> --- trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c (original)
>> +++ trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c 2012-03-01 10:53:39 EST (Thu, 01 Mar 2012)
>> @@ -1,3 +1,4 @@
>> +/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
>> /*
>> * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>> * University Research and Technology
>> @@ -12,6 +13,8 @@
>> * Copyright (c) 2008 UT-Battelle, LLC. All rights reserved.
>> * Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
>> * Copyright (c) 2012 NVIDIA Corporation. All rights reserved.
>> + * Copyright (c) 2012 Los Alamos National Security, LLC. All rights
>> + * reserved.
>> * $COPYRIGHT$
>> *
>> * Additional copyrights may follow
>> @@ -546,15 +549,14 @@
>> }
>> return OMPI_SUCCESS;
>> }
>> - switch(OPAL_SOS_GET_ERROR_CODE(rc)) {
>> - case OMPI_ERR_RESOURCE_BUSY:
>> - /* No more resources. Allow the upper level to queue the send */
>> - rc = OMPI_ERR_OUT_OF_RESOURCE;
>> - break;
>> - default:
>> - mca_bml_base_free(bml_btl, des);
>> - break;
>> +
>> + if (OMPI_ERR_RESOURCE_BUSY == OPAL_SOS_GET_ERROR_CODE(rc)) {
>> + /* No more resources. Allow the upper level to queue the send */
>> + rc = OMPI_ERR_OUT_OF_RESOURCE;
>> }
>> +
>> + mca_bml_base_free (bml_btl, des);
>> +
>> return rc;
>> }
>>
>> @@ -631,7 +633,7 @@
>> * operation is achieved.
>> */
>>
>> - mca_btl_base_descriptor_t* des;
>> + mca_btl_base_descriptor_t *des, *src = NULL;
>> mca_btl_base_segment_t* segment;
>> mca_pml_ob1_hdr_t* hdr;
>> bool need_local_cb = false;
>> @@ -640,7 +642,6 @@
>> bml_btl = sendreq->req_rdma[0].bml_btl;
>> if((sendreq->req_rdma_cnt == 1) && (bml_btl->btl_flags & (MCA_BTL_FLAGS_GET | MCA_BTL_FLAGS_CUDA_GET))) {
>> mca_mpool_base_registration_t* reg = sendreq->req_rdma[0].btl_reg;
>> - mca_btl_base_descriptor_t* src;
>> size_t i;
>> size_t old_position = sendreq->req_send.req_base.req_convertor.bConverted;
>>
>> @@ -781,6 +782,10 @@
>> return OMPI_SUCCESS;
>> }
>> mca_bml_base_free(bml_btl, des);
>> + if (NULL != src) {
>> + mca_bml_base_free (bml_btl, src);
>> + }
>> +
>> return rc;
>> }
>>
>> @@ -1144,7 +1149,7 @@
>> 0,
>> &frag->rdma_length,
>> MCA_BTL_DES_FLAGS_BTL_OWNERSHIP |
>> - MCA_BTL_DES_FLAGS_PUT,
>> + MCA_BTL_DES_FLAGS_PUT,
>> &des );
>>
>> if( OPAL_UNLIKELY(NULL == des) ) {
>> _______________________________________________
>> svn mailing list
>> svn_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>