Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] IBCM error
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-07-14 07:05:23


Right about when Brad and I discovered that issue, I ran out of time.
This made IBCM more-or-less unusable for many installations -- we were
kinda hoping for an OpenFabrics fix...

On Jul 13, 2008, at 12:43 PM, Pavel Shamis (Pasha) wrote:

> Fixed in https://svn.open-mpi.org/trac/ompi/changeset/18897
>
> Is it any other know IBCM issue ?
>
> Regards,
> Pasha
>
> Jeff Squyres wrote:
>> I think you said opposite things: Lenny's command line did not
>> specifically ask for ibcm, but it was used anyway. Lenny -- did
>> you explicitly request it somewhere else (e.g., env var or MCA
>> param file)?
>>
>> I suspect that you did not; I suspect (without looking at the code
>> again) that ibcm tried to select itself and failed on the
>> ibcm_listen() call, so it fell back to oob. This might have to be
>> another workaround in OMPI, perhaps something like this:
>>
>> if (ibcm_listen() fails)
>> if (ibcm explicitly requested)
>> print_warning()
>> fail to use ibcm
>>
>> Has this been filed as a bug at openfabrics.org? I don't think
>> that I filed it when Brad and I were testing on RoadRunner -- it
>> would probably be good if someone filed it.
>>
>>
>>
>> On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote:
>>
>>> Pasha is right, I didn't disabled it.
>>>
>>> On 7/13/08, Pavel Shamis (Pasha) <pasha_at_[hidden]> wrote:
>>> Jeff Squyres wrote:
>>> Brad and I did some scale testing of IBCM and saw this error
>>> sometimes. It seemed to happen with higher frequency when you
>>> increased the number of processes on a single node.
>>>
>>> I talked to Sean Hefty about it, but we never figured out a
>>> definitive cause or solution. My best guess is that there is
>>> something wonky about multiple processes simultaneously
>>> interacting with the IBCM kernel driver from userspace; but I
>>> don't know jack about kernel stuff, so that's a total SWAG.
>>>
>>> Thanks for reminding me of this issue; I admit that I had
>>> forgotten about it. :-( Pasha -- should IBCM not be the default?
>>> It is not default. I guess Lenny configured it explicitly, is not
>>> it ?
>>>
>>> Pasha.
>>>
>>>
>>>
>>>
>>>
>>> On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:
>>>
>>> Hi,
>>>
>>> I am getting this error sometimes.
>>>
>>> /home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile /
>>> home/USERS/lenny/TESTS/COMPILERS/hostfile /home/USERS/lenny/TESTS/
>>> COMPILERS/hello
>>> [witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/
>>> btl_openib_connect_ibcm.c:769:ibcm_component_query] failed to
>>> ib_cm_listen 10 times: rc=-1, errno=22
>>> Hello world! I'm 0 of 100 on witch2
>>>
>>>
>>> Best Regards
>>>
>>> Lenny.
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>

-- 
Jeff Squyres
Cisco Systems