Correction: That line below should be:

gmake run FILE=p2p_c

 

From: devel [mailto:devel-bounces@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Tuesday, September 03, 2013 4:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

 

I just retried and I still get errors with the latest trunk. (29112).  If I back up to r29057, then everything is fine.  In addition, I can reproduce this on two different clusters.

Can you try running the entire intel test suite and see if that works?  Maybe a different test will fail for you.

 

   cd ompi-tests/trunk/intel_tests/src

    gmake run FILE=cuda_c

 

You need to modify Makefile in intel_tests to make it do the right thing.  Trying to figure out what I should do next.  As I said, I get a variety of different failures.  Maybe I should collect them up and see what it means.  This failure has me dead in the water with the trunk.

 

 

 

From: devel [mailto:devel-bounces@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:41 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

 

Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single hiccup.

 

Try a fresh checkout - let's make sure you don't have some old cruft laying around.

 

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart <rvandevaart@nvidia.com> wrote:

 

I am running a debug build.  Here is my configure line:

 

../configure --enable-debug --enable-shared --disable-static --prefix=/home/rolf/ompi-trunk-29061/64 --with- wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt --enable-orterun-prefix-by-default -disable-io-romio  --enable-picky

 

The test program is from the intel test suite in our test suite.

 

Run with at least np=4.  The more np, the better.

 

 

From: devel [mailto:devel-bounces@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 3:22 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

 

Also, send me your test code - maybe that is required to trigger it

 

On Sep 3, 2013, at 12:19 PM, Ralph Castain <rhc@open-mpi.org> wrote:



Dang - I just finished running it on odin without a problem. Are you seeing this with a debug or optimized build?

 

 

On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandevaart@nvidia.com> wrote:



Yes, it fails on the current trunk (r29112).  That is what started me on the journey to figure out when things went wrong.  It was working up until r29058.

 

From: devel [mailto:devel-bounces@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

 

Are you all the way up to the current trunk? There have been a few typo fixes since the original commit.

 

I'm not familiar with the OOB connect code in openib. The OOB itself isn't using free list, so I suspect it is something up in the OOB connect code itself. I'll take a look and see if something leaps out at me - it seems to be working fine on IU's odin cluster, which is the only IB-based system I can access

 

 

On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandevaart@nvidia.com> wrote:




As mentioned in the weekly conference call, I am seeing some strange errors when using the openib BTL.  I have narrowed down the changeset that broke things to the ORTE async code.

 

 

Changeset 29057 does not have these issues.  I do not have a very good characterization of the failures.  The failures are not consistent.  Sometimes they can pass.  Sometimes the stack trace can be different.  They seem to happen more with larger np, like np=4 and more.   

 

The first failure mode is a segmentation violation and it always seems to be that we are trying to pop something of a free list.  But the upper parts of the stack trace can vary.  This is with the trunk version 29061.

Ralph, any thoughts on where we go from here?

 

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 MPI_Irecv_comm_c

MPITEST info  (0): Starting:  MPI_Irecv_comm:   

[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) [compute-0-4:04752] Failing at address: 0x28

--------------------------------------------------------------------------

mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------

[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

Core was generated by `MPI_Irecv_comm_c'.

Program terminated with signal 11, Segmentation fault.

[New process 4753]

[New process 4756]

[New process 4755]

[New process 4754]

[New process 4752]

#0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at ../../../../../opal/class/opal_atomic_lifo.h:111

111             lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;

(gdb) where

#0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at ../../../../../opal/class/opal_atomic_lifo.h:111

#1  0x00002aaaad6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228

#2  0x00002aaaad6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361

#3  0x00002aaaad6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, qp=0)

    at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405

#4  0x00002aaaad6ebfad in mca_btl_openib_endpoint_post_recvs (endpoint=0x59f3120)

    at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494

#5  0x00002aaaad6fe71c in qp_create_all (endpoint=0x59f3120) at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432

#6  0x00002aaaad6fde2b in reply_start_connect (endpoint=0x59f3120, rem_info=0x40ea8ed0)

    at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245

#7  0x00002aaaad7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, buffer=0x40ea8f80, tag=102, cbdata=0x0)

    at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858

#8  0x00002ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, cbdata=0x5b0bac0)

    at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172

#9  0x00002ae8027164a1 in event_process_active_single_queue (base=0x58ac620, activeq=0x58aa5b0)

    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367

#10 0x00002ae802716b24 in event_process_active (base=0x58ac620) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437

#11 0x00002ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, flags=1)

    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645

#12 0x00002ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at ../../orte/runtime/orte_init.c:180

#13 0x0000003ab1e06367 in start_thread () from /lib64/libpthread.so.0

#14 0x0000003ab16d2f7d in clone () from /lib64/libc.so.6

(gdb)

 


This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.


_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

 

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

 

 

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel