Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc: mpirun hangs on ia64
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2014-01-22 17:22:44


On Wed, Jan 22, 2014 at 1:59 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Huh - afraid I can't see anything wrong so far. All looks normal and then
> it just hangs. Any chance you can "gdb" to the proc and see where it is
> stuck?
>

Ralph,

The gstack output below looks like one thread is spinning on an atomic of
some sort.
Running gstack repeatedly 100 times yields the following "histogram" of the
top frame of Thread 1:

     47 opal_atomic_lifo_push > opal_atomic_cmpset_ptr >
opal_atomic_cmpset_acq_64
     19 opal_atomic_lifo_push > opal_atomic_cmpset_ptr
      6 opal_atomic_lifo_push > opal_atomic_wmb
     28 opal_atomic_lifo_push

A spin in a lifo push is not consistent (in my experience) with the
possibility that the other thread and failed to post some event. So, the
problem is probably in the atomics or lifo code, though "make check" passes
just fine.

My ia64 asm is a bit rusty, but I'll give a quick look if/when I can.
I've implemented a lock-free LIFO for ia64 in the past and so have some
idea what I am looking at/for.
However, with my access window closing under 10 minutes from now, anything
more than source inspection will need to wait until tomorrow.

-Paul

$ gstack 21094
Thread 2 (Thread 0x20000000016bf200 (LWP 21095)):
#0 0xa000000000010721 in __kernel_syscall_via_break ()
#1 0x20000000005a00d0 in poll () from /lib/libc.so.6.1
#2 0x2000000000a0c3e0 in poll_dispatch () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libopen-pal.so.6
#3 0x20000000009e5e90 in opal_libevent2021_event_base_loop () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libopen-pal.so.6
#4 0x20000000006bd8a0 in orte_progress_thread_engine () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libopen-rte.so.7
#5 0x20000000003dc310 in start_thread () from /lib/libpthread.so.0
#6 0x20000000005b49a0 in __clone2 () from /lib/libc.so.6.1
#7 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x20000000000566a0 (LWP 21094)):
#0 0x20000000000973f2 in opal_atomic_cmpset_acq_64 () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#1 0x2000000000097350 in opal_atomic_cmpset_ptr () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#2 0x20000000000995d0 in opal_atomic_lifo_push () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#3 0x2000000000099030 in ompi_free_list_grow () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#4 0x200000000009a2a0 in ompi_rb_tree_init () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#5 0x200000000029ec10 in mca_mpool_base_tree_init () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#6 0x2000000000299380 in mca_mpool_base_open () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#7 0x200000000098fd80 in mca_base_framework_open () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libopen-pal.so.6
#8 0x200000000010d6b0 in ompi_mpi_init () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#9 0x20000000001b3460 in PMPI_Init () from
/eng/home/PHHargrove/OMPI/openmpi-1.7-latest-linux-ia64/INST/lib/libmpi.so.1
#10 0x4000000000000c00 in main ()

>
> On Jan 22, 2014, at 11:39 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> Ralph,
>
> Attached is the requested output with the addition of "-mca
> grpcomm_base_verbose 5".
> I have also attached a 2nd output with the further addition of "-mca
> oob_tcp_if_include lo" to ensure that this is not related to the firewall
> issues I've seen on other hosts.
>
> I have use of this host until 14:30 PST today, and then lose it for 12
> hours.
> So, tests of the next tarball won't start until after 2:30am - which
> probably means 10am.
>
> -Paul
>
>
> On Wed, Jan 22, 2014 at 7:34 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Weird - everything looks completely normal. Can you add -mca
>> grpcomm_base_verbose 5 to your cmd line?
>>
>>
>> On Jan 22, 2014, at 1:38 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>
>> Following-up as promised:
>>
>> Output from an --enable-debug build is attached.
>>
>> -Paul
>>
>>
>> On Tue, Jan 21, 2014 at 11:25 PM, Paul Hargrove <phhargrove_at_[hidden]>wrote:
>>
>>> Yes, this is familiar. See:
>>> http://www.open-mpi.org/community/lists/devel/2013/11/13182.php
>>>
>>> If I understand correctly, the thread ended with:
>>>
>>> On 03 December 2013, Sylvestre Ledru wrote:
>>>
>>>> FYI, Debian has stopped supporting ia64 for its next release....
>>>> So, I stopped working on that issue.
>>>
>>>
>>> Well, I have access to a Linux/IA64 system and my trials with
>>> openmpi-1.7.4rc2r30361 appear to hang, much as Sylvestre had reported w/
>>> 1.6.5.
>>>
>>> I am atatching output from a build W/O --enable debug for the command:
>>> $ mpirun -mca plm_base_verbose 5 -mca ras_base_verbose 5 -mca
>>> rmaps_base_verbose 5 -mca ess_base_verbose 5 -np 1 ./ring_c
>>>
>>> I will follow-up with an --enable-debug build when possible.
>>>
>>> -Paul
>>>
>>> --
>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>> Future Technologies Group
>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> <log.txt.bz2>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <log.txt.bz2><log-incl-lo.txt.bz2>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900