Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Eloi Gaudry (eg_at_[hidden])
Date: 2010-08-10 10:04:08


Hi,

Here is the output of a core file generated during a segmentation fault observed during a collective call (using openib):

#0 0x0000000000000000 in ?? ()
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x00002aedbc4e05f4 in btl_openib_handle_incoming (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at btl_openib_component.c:2881
#2 0x00002aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0, wc=0x7ffff279ce90) at btl_openib_component.c:3178
#3 0x00002aedbc4e2e9d in poll_device (device=0x19024ac0, count=2) at btl_openib_component.c:3318
#4 0x00002aedbc4e34b8 in progress_one_device (device=0x19024ac0) at btl_openib_component.c:3426
#5 0x00002aedbc4e3561 in btl_openib_component_progress () at btl_openib_component.c:3451
#6 0x00002aedb8b22ab8 in opal_progress () at runtime/opal_progress.c:207
#7 0x00002aedb859f497 in opal_condition_wait (c=0x2aedb888ccc0, m=0x2aedb888cd20) at ../opal/threads/condition.h:99
#8 0x00002aedb859fa31 in ompi_request_default_wait_all (count=2, requests=0x7ffff279d0e0, statuses=0x0) at request/req_wait.c:262
#9 0x00002aedbd7559ad in ompi_coll_tuned_allreduce_intra_recursivedoubling (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20)
    at coll_tuned_allreduce.c:223
#10 0x00002aedbd7514f7 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1, dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0, module=0x19d82b20)
    at coll_tuned_decision_fixed.c:63
#11 0x00002aedb85c7792 in PMPI_Allreduce (sendbuf=0x7ffff279d444, recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at pallreduce.c:102
#12 0x0000000004387dbf in FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444, recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, op=0x6787a20, comm=0x19d81ff0) at stubs.cpp:626
#13 0x0000000004058be8 in FEMTown::Domain::align (itf=
            {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Interface>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px = 0x199942a4, pn = {pi_ = 0x6}}}, <No data fields>})
    at interface.cpp:371
#14 0x00000000040cb858 in FEMTown::Field::detail::align_itfs_and_neighbhors (dim=2, set={px = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}}, check_info=@0x7ffff279d7f0) at check.cpp:63
#15 0x00000000040cbfa8 in FEMTown::Field::align_elements (set={px = 0x7ffff279d950, pn = {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at check.cpp:159
#16 0x00000000039acdd4 in PyField_align_elements (self=0x0, args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31
#17 0x0000000001fbf76d in FEMTown::Main::ExErrCatch<_object* (*)(_object*, _object*, _object*)>::exec<_object> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050, po2=0x19d2e950)
    at /home/qa/svntop/femtown/modules/main/py/exception.hpp:463
#18 0x00000000039acc82 in PyField_align_elements_ewrap (self=0x0, args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39
#19 0x00000000044093a0 in PyEval_EvalFrameEx (f=0x19b52e90, throwflag=<value optimized out>) at Python/ceval.c:3921
#20 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab754ad50, globals=<value optimized out>, locals=<value optimized out>, args=0x3, argcount=1, kws=0x19ace4a0, kwcount=2, defs=0x2aaab75e4800,
    defcount=2, closure=0x0) at Python/ceval.c:2968
#21 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19ace2d0, throwflag=<value optimized out>) at Python/ceval.c:3802
#22 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab7550120, globals=<value optimized out>, locals=<value optimized out>, args=0x7, argcount=1, kws=0x19acc418, kwcount=3, defs=0x2aaab759e958,
    defcount=6, closure=0x0) at Python/ceval.c:2968
#23 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19acc1c0, throwflag=<value optimized out>) at Python/ceval.c:3802
#24 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=<value optimized out>, locals=<value optimized out>, args=0x6, argcount=1, kws=0x19abd328, kwcount=5, defs=0x2aaab891b7e8,
    defcount=3, closure=0x0) at Python/ceval.c:2968
#25 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19abcea0, throwflag=<value optimized out>) at Python/ceval.c:3802
#26 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=<value optimized out>, locals=<value optimized out>, args=0xb, argcount=1, kws=0x19a89df0, kwcount=10, defs=0x0,
    defcount=0, closure=0x0) at Python/ceval.c:2968
#27 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a89c40, throwflag=<value optimized out>) at Python/ceval.c:3802
#28 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=<value optimized out>, locals=<value optimized out>, args=0x1, argcount=0, kws=0x19a89330, kwcount=0, defs=0x2aaab8b66668,
    defcount=1, closure=0x0) at Python/ceval.c:2968
#29 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a891b0, throwflag=<value optimized out>) at Python/ceval.c:3802
#30 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=<value optimized out>, locals=<value optimized out>, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
    closure=0x0) at Python/ceval.c:2968
#31 0x000000000440ac02 in PyEval_EvalCode (co=0x1902f9b0, globals=0x0, locals=0x190d9700) at Python/ceval.c:522
#32 0x000000000442853c in PyRun_StringFlags (str=0x192fd3d8 "DIRECT.Actran.main()", start=<value optimized out>, globals=0x192213d0, locals=0x192213d0, flags=0x0) at Python/pythonrun.c:1335
#33 0x0000000004429690 in PyRun_SimpleStringFlags (command=0x192fd3d8 "DIRECT.Actran.main()", flags=0x0) at Python/pythonrun.c:957
#34 0x0000000001fa1cf9 in FEMTown::Python::FEMPy::run_application (this=0x7ffff279f650) at fempy.cpp:873
#35 0x000000000434ce99 in FEMTown::Main::Batch::run (this=0x7ffff279f650) at batch.cpp:374
#36 0x0000000001f9aa25 in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10
(gdb) f 1
#1 0x00002aedbc4e05f4 in btl_openib_handle_incoming (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at btl_openib_component.c:2881
2881 reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata );
Current language: auto; currently c
(gdb)
#1 0x00002aedbc4e05f4 in btl_openib_handle_incoming (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at btl_openib_component.c:2881
2881 reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata );
(gdb) l
2876
2877 if(OPAL_LIKELY(!(is_credit_msg = is_credit_message(frag)))) {
2878 /* call registered callback */
2879 mca_btl_active_message_callback_t* reg;
2880 reg = mca_btl_base_active_message_trigger + hdr->tag;
2881 reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata );
2882 if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) {
2883 cqp = (hdr->credits >> 11) & 0x0f;
2884 hdr->credits &= 0x87ff;
2885 } else {

Regards,
Eloi

On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
> Hi Edgar,
>
> The only difference I could observed was that the segmentation fault
> appeared sometimes later during the parallel computation.
>
> I'm running out of idea here. I wish I could use the "--mca coll tuned"
> with "--mca self,sm,tcp" so that I could check that the issue is not
> somehow limited to the tuned collective routines.
>
> Thanks,
> Eloi
>
> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> > On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > > hi edgar,
> > >
> > > thanks for the tips, I'm gonna try this option as well. the
> > > segmentation fault i'm observing always happened during a collective
> > > communication indeed... does it basically switch all collective
> > > communication to basic mode, right ?
> > >
> > > sorry for my ignorance, but what's a NCA ?
> >
> > sorry, I meant to type HCA (InifinBand networking card)
> >
> > Thanks
> > Edgar
> >
> > > thanks,
> > > éloi
> > >
> > > On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> > >> you could try first to use the algorithms in the basic module, e.g.
> > >>
> > >> mpirun -np x --mca coll basic ./mytest
> > >>
> > >> and see whether this makes a difference. I used to observe sometimes a
> > >> (similar ?) problem in the openib btl triggered from the tuned
> > >> collective component, in cases where the ofed libraries were installed
> > >> but no NCA was found on a node. It used to work however with the basic
> > >> component.
> > >>
> > >> Thanks
> > >> Edgar
> > >>
> > >> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> > >>> hi Rolf,
> > >>>
> > >>> unfortunately, i couldn't get rid of that annoying segmentation fault
> > >>> when selecting another bcast algorithm. i'm now going to replace
> > >>> MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv)
> > >>> and see if that helps.
> > >>>
> > >>> regards,
> > >>> éloi
> > >>>
> > >>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> > >>>> Hi Rolf,
> > >>>>
> > >>>> thanks for your input. You're right, I miss the
> > >>>> coll_tuned_use_dynamic_rules option.
> > >>>>
> > >>>> I'll check if I the segmentation fault disappears when using the
> > >>>> basic bcast linear algorithm using the proper command line you
> > >>>> provided.
> > >>>>
> > >>>> Regards,
> > >>>> Eloi
> > >>>>
> > >>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> > >>>>> Hi Eloi:
> > >>>>> To select the different bcast algorithms, you need to add an extra
> > >>>>> mca parameter that tells the library to use dynamic selection.
> > >>>>> --mca coll_tuned_use_dynamic_rules 1
> > >>>>>
> > >>>>> One way to make sure you are typing this in correctly is to use it
> > >>>>> with ompi_info. Do the following:
> > >>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> > >>>>>
> > >>>>> You should see lots of output with all the different algorithms
> > >>>>> that can be selected for the various collectives.
> > >>>>> Therefore, you need this:
> > >>>>>
> > >>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
> > >>>>> coll_tuned_bcast_algorithm 1
> > >>>>>
> > >>>>> Rolf
> > >>>>>
> > >>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to
> > >>>>>> switch to the basic linear algorithm. Anyway whatever the
> > >>>>>> algorithm used, the segmentation fault remains.
> > >>>>>>
> > >>>>>> Does anyone could give some advice on ways to diagnose the issue
> > >>>>>> I'm facing ?
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Eloi
> > >>>>>>
> > >>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> I'm focusing on the MPI_Bcast routine that seems to randomly
> > >>>>>>> segfault when using the openib btl. I'd like to know if there is
> > >>>>>>> any way to make OpenMPI switch to a different algorithm than the
> > >>>>>>> default one being selected for MPI_Bcast.
> > >>>>>>>
> > >>>>>>> Thanks for your help,
> > >>>>>>> Eloi
> > >>>>>>>
> > >>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> I'm observing a random segmentation fault during an internode
> > >>>>>>>> parallel computation involving the openib btl and OpenMPI-1.4.2
> > >>>>>>>> (the same issue can be observed with OpenMPI-1.3.3).
> > >>>>>>>>
> > >>>>>>>> mpirun (Open MPI) 1.4.2
> > >>>>>>>> Report bugs to http://www.open-mpi.org/community/help/
> > >>>>>>>> [pbn08:02624] *** Process received signal ***
> > >>>>>>>> [pbn08:02624] Signal: Segmentation fault (11)
> > >>>>>>>> [pbn08:02624] Signal code: Address not mapped (1)
> > >>>>>>>> [pbn08:02624] Failing at address: (nil)
> > >>>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
> > >>>>>>>> [pbn08:02624] *** End of error message ***
> > >>>>>>>> sh: line 1: 2624 Segmentation fault
> > >>>>>>>>
> > >>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\
> > >>>>>>>> -5 \/ x 86 _6 4\ /bin\/actranpy_mp
> > >>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5
> > >>>>>>>> /x 86 _ 64 /A c tran_11.0.rc2.41872'
> > >>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_
> > >>>>>>>> n2 .d a t'
> > >>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch'
> > >>>>>>>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1'
> > >>>>>>>> '--parallel=domain'
> > >>>>>>>>
> > >>>>>>>> If I choose not to use the openib btl (by using --mca btl
> > >>>>>>>> self,sm,tcp on the command line, for instance), I don't
> > >>>>>>>> encounter any problem and the parallel computation runs
> > >>>>>>>> flawlessly.
> > >>>>>>>>
> > >>>>>>>> I would like to get some help to be able:
> > >>>>>>>> - to diagnose the issue I'm facing with the openib btl
> > >>>>>>>> - understand why this issue is observed only when using the
> > >>>>>>>> openib btl and not when using self,sm,tcp
> > >>>>>>>>
> > >>>>>>>> Any help would be very much appreciated.
> > >>>>>>>>
> > >>>>>>>> The outputs of ompi_info and the configure scripts of OpenMPI
> > >>>>>>>> are enclosed to this email, and some information on the
> > >>>>>>>> infiniband drivers as well.
> > >>>>>>>>
> > >>>>>>>> Here is the command line used when launching a parallel
> > >>>>>>>> computation
> > >>>>>>>>
> > >>>>>>>> using infiniband:
> > >>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> > >>>>>>>> --mca
> > >>>>>>>>
> > >>>>>>>> btl openib,sm,self,tcp --display-map --verbose --version --mca
> > >>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > >>>>>>>>
> > >>>>>>>> and the command line used if not using infiniband:
> > >>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list
> > >>>>>>>> --mca
> > >>>>>>>>
> > >>>>>>>> btl self,sm,tcp --display-map --verbose --version --mca
> > >>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Eloi
> > >>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> users mailing list
> > >>>>>> users_at_[hidden]
> > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959