Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Mike Houston (mhouston_at_[hidden])
Date: 2007-03-27 13:44:55


Well, mpich2 and mvapich2 are working smoothly for my app. mpich2 under
gige is also giving ~2X the performance of openmpi during the working
cases for openmpi. After the paper deadline, I'll attempt to package up
a simple test case and send it to the list.

Thanks!

-Mike

Mike Houston wrote:
> Sadly, I've just hit this problem again, so I'll have to find another
> MPI implementation as I have a paper deadline quickly approaching.
>
> I'm using single threads now, but I had very similar issues when using
> multiple threads and issuing send/recv on one thread and waiting on a
> posted MPI_Recv on another. The issue seems to actually be with
> MPI_Gets. I can do heavy MPI_Put's and things seem okay. But as soon
> as I have a similar communication pattern with MPI_Get's things get
> unstable.
>
> -Mike
>
> Brian Barrett wrote:
>
>> Mike -
>>
>> In Open MPI 1.2, one-sided is implemented over point-to-point, so I
>> would expect it to be slower. This may or may not be addressed in a
>> future version of Open MPI (I would guess so, but don't want to
>> commit to it). Where you using multiple threads? If so, how?
>>
>> On the good news, I think your call stack looked similar to what I
>> was seeing, so hopefully I can make some progress on a real solution.
>>
>> Brian
>>
>> On Mar 20, 2007, at 8:54 PM, Mike Houston wrote:
>>
>>
>>
>>> Well, I've managed to get a working solution, but I'm not sure how
>>> I got
>>> there. I built a test case that looked like a nice simple version of
>>> what I was trying to do and it worked, so I moved the test code
>>> into my
>>> implementation and low and behold it works. I must have been doing
>>> something a little funky in the original pass, likely causing a stack
>>> smash somewhere or trying to do a get/put out of bounds.
>>>
>>> If I have any more problems, I'll let y'all know. I've tested pretty
>>> heavy usage up to 128 MPI processes across 16 nodes and things seem to
>>> be behaving. I did notice that single sided transfers seem to be a
>>> little slower than explicit send/recv, at least on GigE. Once I do
>>> some
>>> more testing, I'll bring things up on IB and see how things are going.
>>>
>>> -Mike
>>>
>>> Mike Houston wrote:
>>>
>>>
>>>> Brian Barrett wrote:
>>>>
>>>>
>>>>
>>>>> On Mar 20, 2007, at 3:15 PM, Mike Houston wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> If I only do gets/puts, things seem to be working correctly with
>>>>>> version
>>>>>> 1.2. However, if I have a posted Irecv on the target node and
>>>>>> issue a
>>>>>> MPI_Get against that target, MPI_Test on the posed IRecv causes a
>>>>>> segfaults:
>>>>>>
>>>>>> Anyone have suggestions? Sadly, I need to have IRecv's posted.
>>>>>> I'll
>>>>>> attempt to find a workaround, but it looks like the posed IRecv is
>>>>>> getting all the data of the MPI_Get from the other node. It's like
>>>>>> the
>>>>>> message tagging is getting ignored. I've never tried posting two
>>>>>> different IRecv's with different message tags either...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Hi Mike -
>>>>>
>>>>> I've spent some time this afternoon looking at the problem and have
>>>>> some ideas on what could be happening. I don't think it's a data
>>>>> mismatch (the data intended for the IRecv getting delivered to the
>>>>> Get), but more a problem with the call to MPI_Test perturbing the
>>>>> progress flow of the one-sided engine. I can see one or two places
>>>>> where it's possible this could happen, although I'm having trouble
>>>>> replicating the problem with any test case I can write. Is it
>>>>> possible for you to share the code causing the problem (or some
>>>>> small
>>>>> test case)? It would make me feel considerably better if I could
>>>>> really understand the conditions required to end up in a seg fault
>>>>> state.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Well, I can give you a linux x86 binary if that would do it. The
>>>> code
>>>> is huge as it's part of a much larger system, so there is no such
>>>> thing
>>>> as a simple case at the moment, and the code is in pieces an largely
>>>> unrunnable now with all the hacking...
>>>>
>>>> I basically have one thread spinning on an MPI_Test on a posted IRecv
>>>> while being used as the target to the MPI_Get. I'll see if I can
>>>> hack
>>>> together a simple version that breaks late tonight. I've just played
>>>> with posting a send to that IRecv, issuing the MPI_Get,
>>>> handshaking and
>>>> then posting another IRecv and the MPI_Test continues to eat it,
>>>> but in
>>>> a memcpy:
>>>>
>>>> #0 0x001c068c in memcpy () from /lib/libc.so.6
>>>> #1 0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0,
>>>> out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254
>>>> #2 0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668,
>>>> replyreq=0x83c1180) at osc_pt2pt_data_move.c:411
>>>> #3 0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb
>>>> (pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582
>>>> #4 0x00ea1389 in ompi_osc_pt2pt_progress () at
>>>> osc_pt2pt_component.c:769
>>>> #5 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288
>>>> #6 0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668,
>>>> origin=1, count=1) at osc_pt2pt_sync.c:60
>>>> #7 0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb
>>>> (pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688
>>>> #8 0x00ea1389 in ompi_osc_pt2pt_progress () at
>>>> osc_pt2pt_component.c:769
>>>> #9 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288
>>>> #10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430,
>>>> completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82
>>>> #11 0x00e61770 in PMPI_Test (request=0xaffc2430,
>>>> completed=0xaffc2434,
>>>> status=0xaffc23fc) at ptest.c:52
>>>>
>>>> -Mike
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>