Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI fails when launched with srun using openib btl.
From: Victor Kocheganov (victor.kocheganov_at_[hidden])
Date: 2013-09-20 09:48:37


I have HEAD on git revision:

commit 4c282fe5bc8a4143a8c6ac5c0f8d4af591277f6f
Author: Ralph Castain <rhc_at_[hidden]>
Date: Sun Sep 15 15:33:51 2013 +0000

May be there is a difference in PMI? I have PMI-1 on machine.

On Fri, Sep 20, 2013 at 5:37 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> What revision level are you at? I just checked and it worked fine for me
>
> On Sep 20, 2013, at 2:33 AM, Victor Kocheganov <
> victor.kocheganov_at_[hidden]> wrote:
>
> Hi folks!
>
> I am trying to launch *MPI master branch* with srun (simple send/recv
> program, see attach) and using *openib*, but unfortunately I get a *
> segfault*.
>
> Below is my workflow.
> 1) I configured ompi/master with following line:
>
> ./autogen.sh && ./configure --prefix=$PWD/install --with-openib
> --with-pmi && make -j3 && make install -j3
>
> 2) exported (along with PATH and LD_LIBRARY_PATH) OMPI_MCA_btl variable:
>
> export OMPI_MCA_btl=self,openib
>
> 3) and launched with following line:
>
> mpicc ~/usefull_tests/mpi_init.c && srun -n 2 ./a.out
>
>
> Eventually I get following error:
>
> srun: error: mir6: task 1: Segmentation fault (core dumped)
> srun: Terminating job step 17309.2
>
>
> with following backtrace:
>
> #0 0x00007f856c47b1d0 in ?? ()
> #1 <signal handler called>
> #2 0x00007f856d12d721 in rml_recv_cb (status=0, process_name=0x2027c50,
> buffer=0x7f857084ed10,
> tag=102, cbdata=0x0) at connect/btl_openib_connect_oob.c:823
> #3 0x00007f857553ffb0 in orte_rml_base_process_msg (fd=-1, flags=4,
> cbdata=0x2027b80)
> at base/rml_base_msg_handlers.c:172
> #4 0x00007f857522a6c6 in event_process_active_single_queue
> (base=0x1ed6c60, activeq=0x1ec9210)
> at event.c:1367
> #5 0x00007f857522a93e in event_process_active (base=0x1ed6c60) at
> event.c:1437
> #6 0x00007f857522afbc in opal_libevent2021_event_base_loop
> (base=0x1ed6c60, flags=1) at event.c:1645
> #7 0x00007f85754ccc19 in orte_progress_thread_engine (obj=0x7f857577cf20)
> at runtime/orte_init.c:180
> #8 0x0000003b5a6077f1 in start_thread () from /lib64/libpthread.so.0
> #9 0x0000003b59ee570d in clone () from /lib64/libc.so.6
>
>
>
> Can anybody please help with a reason of such failure?
>
> P.s. I use Red Hat Enterprise Linux Server release 6.2 with InfiniBand
> cards.
>
> Thanks in advance,
> Victor Kocheganov.
> <mpi_test.c>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>