Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] btl_openib_cpc_include rdmacm questions
From: Brock Palen (brockp_at_[hidden])
Date: 2011-05-18 10:25:49


Well I have a new wrench into this situation.
We have a power failure at our datacenter took down our entire system nodes,switch,sm.
Now I am unable to produce the error with oob default ibflags etc.

Does this shed any light on the issue? It also makes it hard to now debug the issue without being able to reproduce it.

Any thoughts? Am I overlooking something?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp_at_[hidden]
(734)936-1985

On May 17, 2011, at 2:18 PM, Brock Palen wrote:

> Sorry typo 314 not 313,
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp_at_[hidden]
> (734)936-1985
>
>
>
> On May 17, 2011, at 2:02 PM, Brock Palen wrote:
>
>> Thanks, I though of looking at ompi_info after I sent that note sigh.
>>
>> SEND_INPLACE appears to help performance of larger messages in my synthetic benchmarks over regular SEND. Also it appears that SEND_INPLACE still allows our code to run.
>>
>> We working on getting devs access to our system and code.
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp_at_[hidden]
>> (734)936-1985
>>
>>
>>
>> On May 16, 2011, at 11:49 AM, George Bosilca wrote:
>>
>>> Here is the output of the "ompi_info --param btl openib":
>>>
>>> MCA btl: parameter "btl_openib_flags" (current value: <306>, data
>>> source: default value)
>>> BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
>>> SEND_INPLACE=8, RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags
>>> only used by the "dr" PML (ignored by others): ACK=16,
>>> CHECKSUM=32, RDMA_COMPLETION=128; flags only used by the "bfo"
>>> PML (ignored by others): FAILOVER_SUPPORT=512)
>>>
>>> So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of these flags are totally useless in the current version of Open MPI (DR is not supported), so the only value that really matter is SEND | HETEROGENEOUS_RDMA.
>>>
>>> If you want to enable the send protocol try first with SEND | SEND_INPLACE (9), if not downgrade to SEND (1)
>>>
>>> george.
>>>
>>> On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:
>>>
>>>>
>>>> On May 16, 2011, at 8:53 AM, Brock Palen wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Just out of curiosity - what happens when you add the following MCA option to your openib runs?
>>>>>>
>>>>>> -mca btl_openib_flags 305
>>>>>
>>>>> You Sir found the magic combination.
>>>>
>>>> :-) - cool.
>>>>
>>>> Developers - does this smell like a registered memory availability hang?
>>>>
>>>>> I verified this lets IMB and CRASH progress pass their lockup points,
>>>>> I will have a user test this,
>>>>
>>>> Please let us know what you find.
>>>>
>>>>> Is this an ok option to put in our environment? What does 305 mean?
>>>>
>>>> There may be a performance hit associated with this configuration, but if it lets your users run, then I don't see a problem with adding it to your environment.
>>>>
>>>> If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on SEND.
>>>>
>>>> OpenFabrics gurus - please correct me if I'm wrong :-).
>>>>
>>>> Samuel Gutierrez
>>>> Los Alamos National Laboratory
>>>>
>>>>
>>>>>
>>>>>
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> Center for Advanced Computing
>>>>> brockp_at_[hidden]
>>>>> (734)936-1985
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Samuel Gutierrez
>>>>>> Los Alamos National Laboratory
>>>>>>
>>>>>> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
>>>>>>
>>>>>>> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>>>>>>>
>>>>>>>> Jeff Squyres <jsquyres_at_[hidden]> writes:
>>>>>>>>
>>>>>>>>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>>>>>>>>
>>>>>>>>>> We can reproduce it with IMB. We could provide access, but we'd have to
>>>>>>>>>> negotiate with the owners of the relevant nodes to give you interactive
>>>>>>>>>> access to them. Maybe Brock's would be more accessible? (If you
>>>>>>>>>> contact me, I may not be able to respond for a few days.)
>>>>>>>>>
>>>>>>>>> Brock has replied off-list that he, too, is able to reliably reproduce the issue with IMB, and is working to get access for us. Many thanks for your offer; let's see where Brock's access takes us.
>>>>>>>>
>>>>>>>> Good. Let me know if we could be useful
>>>>>>>>
>>>>>>>>>>> -- we have not closed this issue,
>>>>>>>>>>
>>>>>>>>>> Which issue? I couldn't find a relevant-looking one.
>>>>>>>>>
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2714
>>>>>>>>
>>>>>>>> Thanks. In csse it's useful info, it hangs for me with 1.5.3 & np=32 on
>>>>>>>> connectx with more than one collective I can't recall.
>>>>>>>
>>>>>>> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1, well that doesn't help here, both my production code (crash) and IMB still hang.
>>>>>>>
>>>>>>>
>>>>>>> Brock Palen
>>>>>>> www.umich.edu/~brockp
>>>>>>> Center for Advanced Computing
>>>>>>> brockp_at_[hidden]
>>>>>>> (734)936-1985
>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Excuse the typping -- I have a broken wrist
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> George Bosilca
>>> Research Assistant Professor
>>> Innovative Computing Laboratory
>>> Department of Electrical Engineering and Computer Science
>>> University of Tennessee, Knoxville
>>> http://web.eecs.utk.edu/~bosilca/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>