Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI 1.3 problems
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-08-04 19:58:48


I see one difference, and it probably does lead to Terry's cited
ticket. I always run -mca btl ^sm since I'm only testing
functionality, not performance.

Give that a try and see if it completes. If so, then the problem
probably is related to the ticket cited by Terry. Otherwise, we'll
have to consider other options.

Ralph

On Aug 4, 2008, at 5:50 PM, Greg Watson wrote:

> Configuring with ./configure --prefix=/usr/local/openmpi-1.3-devel --
> with-platform=contrib/platform/lanl/macosx-dynamic --disable-io-romio
>
> Recompiling the app, then running with mpirun -np 5 ./shallow
>
> All processes show R+ as their status. If I attach gdb to a worker I
> get the following stack trace:
>
> (gdb) where
> #0 0x9045e58a in swtch_pri ()
> #1 0x904ccbc1 in sched_yield ()
> #2 0x000f6480 in opal_progress () at runtime/opal_progress.c:220
> #3 0x004bb0bc in opal_condition_wait ()
> #4 0x004bca5c in ompi_request_wait_completion ()
> #5 0x004bc92a in mca_pml_ob1_send ()
> #6 0x003cdcab in MPI_Send ()
> #7 0x0000453f in send_updated_ds (res_type=0x5040, jstart=8,
> jend=11, ds=0xbfff85b0, indx=57, master_id=0) at worker.c:214
> #8 0x0000444d in worker () at worker.c:185
> #9 0x00002e0b in main (argc=1, argv=0xbffff0b8) at main.c:90
>
> The master process shows:
>
> (gdb) where
> #0 0x9045e58a in swtch_pri ()
> #1 0x904ccbc1 in sched_yield ()
> #2 0x000f6480 in opal_progress () at runtime/opal_progress.c:220
> #3 0x004ba8bb in opal_condition_wait ()
> #4 0x004ba6e4 in ompi_request_wait_completion ()
> #5 0x004ba589 in mca_pml_ob1_recv ()
> #6 0x003c80aa in MPI_Recv ()
> #7 0x0000354c in update_global_ds (res_type=0x5040, indx=57,
> ds=0xbfffd068) at main.c:257
> #8 0x00003334 in main (argc=1, argv=0xbffff0b8) at main.c:195
>
> Seems to be stuck in communication.
>
> Greg
>
> On Aug 4, 2008, at 6:12 PM, Ralph Castain wrote:
>
>> Can you tell us how you are configuring and your command line? As I
>> said, I'm having no problem running your code on my Mac w/10.5,
>> both PowerPC and Intel.
>>
>> Ralph
>>
>> On Aug 4, 2008, at 3:10 PM, Greg Watson wrote:
>>
>>> Yes the application does sends/receives. No, it doesn't seem to be
>>> getting past MPI_Init.
>>>
>>> I've reinstalled from a completely new 1.3 branch. Still hangs.
>>>
>>> Greg
>>>
>>> On Aug 4, 2008, at 4:45 PM, Terry Dontje wrote:
>>>
>>>> Are you doing any communications? Have you gotten past
>>>> MPI_Init? Could
>>>> your issue be related to the following ticket?
>>>>
>>>> https://svn.open-mpi.org/trac/ompi/ticket/1378
>>>>
>>>>
>>>> --td
>>>> Greg Watson wrote:
>>>>> I'm seeing the same behavior on trunk as 1.3. The program just
>>>>> hangs.
>>>>>
>>>>> Greg
>>>>>
>>>>> On Aug 4, 2008, at 2:25 PM, Ralph Castain wrote:
>>>>>
>>>>>> Well, I unfortunately cannot test this right now Greg - the 1.3
>>>>>> branch won't build due to a problem with the man page
>>>>>> installation
>>>>>> script. The fix is in the trunk, but hasn't migrated across yet.
>>>>>>
>>>>>> :-//
>>>>>>
>>>>>> My guess is that you are caught on some stage where the hanging
>>>>>> bugs
>>>>>> hadn't been fixed, but you cannot update to the current head of
>>>>>> the
>>>>>> 1.3 branch as it won't compile. All I can suggest is shifting
>>>>>> to the
>>>>>> trunk (which definitely works) for now as the man page fix should
>>>>>> migrate soon.
>>>>>>
>>>>>> Ralph
>>>>>>
>>>>>> On Aug 4, 2008, at 12:12 PM, Ralph Castain wrote:
>>>>>>
>>>>>>> Depending upon the r-level, there was a problem for awhile
>>>>>>> with the
>>>>>>> system hanging that was caused by a couple of completely
>>>>>>> unrelated
>>>>>>> issues. I believe these have been fixed now - at least, it is
>>>>>>> fixed
>>>>>>> on the trunk for me under that same system. I'll check 1.3 now
>>>>>>> - it
>>>>>>> could be that some commits are missing over there.
>>>>>>>
>>>>>>>
>>>>>>> On Aug 4, 2008, at 12:06 PM, Greg Watson wrote:
>>>>>>>
>>>>>>>> I have a fairly simple test program that runs fine under 1.2 on
>>>>>>>> MacOS X 10.5 . When I recompile and run it under 1.3 (head of
>>>>>>>> 1.3
>>>>>>>> branch) it just hangs.
>>>>>>>>
>>>>>>>> They are both built using
>>>>>>>> --with-platform=contrib/platform/lanl/macosx-dynamic. For
>>>>>>>> 1.3, I've
>>>>>>>> added --disable-io-romio.
>>>>>>>>
>>>>>>>> Any suggestions?
>>>>>>>>
>>>>>>>> Greg
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel