Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IBCM error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-07-14 07:05:23


Right about when Brad and I discovered that issue, I ran out of time.
This made IBCM more-or-less unusable for many installations -- we were
kinda hoping for an OpenFabrics fix...

On Jul 13, 2008, at 12:43 PM, Pavel Shamis (Pasha) wrote:

> Fixed in https://svn.open-mpi.org/trac/ompi/changeset/18897
>
> Is it any other know IBCM issue ?
>
> Regards,
> Pasha
>
> Jeff Squyres wrote:
>> I think you said opposite things: Lenny's command line did not
>> specifically ask for ibcm, but it was used anyway. Lenny -- did
>> you explicitly request it somewhere else (e.g., env var or MCA
>> param file)?
>>
>> I suspect that you did not; I suspect (without looking at the code
>> again) that ibcm tried to select itself and failed on the
>> ibcm_listen() call, so it fell back to oob. This might have to be
>> another workaround in OMPI, perhaps something like this:
>>
>> if (ibcm_listen() fails)
>> if (ibcm explicitly requested)
>> print_warning()
>> fail to use ibcm
>>
>> Has this been filed as a bug at openfabrics.org? I don't think
>> that I filed it when Brad and I were testing on RoadRunner -- it
>> would probably be good if someone filed it.
>>
>>
>>
>> On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote:
>>
>>> Pasha is right, I didn't disabled it.
>>>
>>> On 7/13/08, Pavel Shamis (Pasha) <pasha_at_[hidden]> wrote:
>>> Jeff Squyres wrote:
>>> Brad and I did some scale testing of IBCM and saw this error
>>> sometimes. It seemed to happen with higher frequency when you
>>> increased the number of processes on a single node.
>>>
>>> I talked to Sean Hefty about it, but we never figured out a
>>> definitive cause or solution. My best guess is that there is
>>> something wonky about multiple processes simultaneously
>>> interacting with the IBCM kernel driver from userspace; but I
>>> don't know jack about kernel stuff, so that's a total SWAG.
>>>
>>> Thanks for reminding me of this issue; I admit that I had
>>> forgotten about it. :-( Pasha -- should IBCM not be the default?
>>> It is not default. I guess Lenny configured it explicitly, is not
>>> it ?
>>>
>>> Pasha.
>>>
>>>
>>>
>>>
>>>
>>> On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:
>>>
>>> Hi,
>>>
>>> I am getting this error sometimes.
>>>
>>> /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /
>>> home/USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/
>>> COMPILERS/hello
>>> [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/
>>> btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to
>>> ib_cm_listen 10 times: rc=-1, errno=22
>>> Hello world! I'm 0 of 100 on witch2
>>>
>>>
>>> Best Regards
>>>
>>> Lenny.
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>

-- 
Jeff Squyres
Cisco Systems