Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Eloi Gaudry (eg_at_[hidden])
Date: 2010-09-27 10:44:02


Terry,

Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option).
I'm displaying frag->hdr->tag here.

Eloi

On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
> Eloi, sorry can you print out frag->hdr->tag?
>
> Unfortunately from your last email I think it will still all have
> non-zero values.
> If that ends up being the case then there must be something odd with the
> descriptor pointer to the fragment.
>
> --td
>
> Eloi Gaudry wrote:
> > Terry,
> >
> > Please find enclosed the requested check outputs (using -output-filename
> > stdout.tag.null option).
> >
> > For information, Nysal In his first message referred to
> > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on
> > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML
> > + 1)
> > #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2)
> > #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3)
> >
> > #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4)
> >
> > #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5)
> > #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6)
> > #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7)
> >
> > #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8)
> >
> > #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9)
> > and in ompi/mca/btl/btl.h
> > #define MCA_BTL_TAG_PML 0x40
> >
> > Eloi
> >
> > On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
> >> I am thinking checking the value of *frag->hdr right before the return
> >> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
> >> It is line 548 in the trunk
> >> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
> >> ope nib_endpoint.h#548
> >>
> >> --td
> >>
> >> Eloi Gaudry wrote:
> >>> Hi Terry,
> >>>
> >>> Do you have any patch that I could apply to be able to do so ? I'm
> >>> remotely working on a cluster (with a terminal) and I cannot use any
> >>> parallel debugger or sequential debugger (with a call to xterm...). I
> >>> can track frag->hdr->tag value in
> >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
> >>> SEND/RDMA_WRITE case, but this is all I can think of alone.
> >>>
> >>> You'll find a stacktrace (receive side) in this thread (10th or 11th
> >>> message) but it might be pointless.
> >>>
> >>> Regards,
> >>> Eloi
> >>>
> >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
> >>>> So it sounds like coalescing is not your issue and that the problem
> >>>> has something to do with the queue sizes. It would be helpful if we
> >>>> could detect the hdr->tag == 0 issue on the sending side and get at
> >>>> least a stack trace. There is something really odd going on here.
> >>>>
> >>>> --td
> >>>>
> >>>> Eloi Gaudry wrote:
> >>>>> Hi Terry,
> >>>>>
> >>>>> I'm sorry to say that I might have missed a point here.
> >>>>>
> >>>>> I've lately been relaunching all previously failing computations with
> >>>>> the message coalescing feature being switched off, and I saw the same
> >>>>> hdr->tag=0 error several times, always during a collective call
> >>>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
> >>>>> soon as I switched to the peer queue option I was previously using
> >>>>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
> >>>>> --mca btl_openib_use_message_coalescing 0), all computations ran
> >>>>> flawlessly.
> >>>>>
> >>>>> As for the reproducer, I've already tried to write something but I
> >>>>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
> >>>>>
> >>>>> Eloi
> >>>>>
> >>>>> On 24/09/2010 18:37, Terry Dontje wrote:
> >>>>>> Eloi Gaudry wrote:
> >>>>>>> Terry,
> >>>>>>>
> >>>>>>> You were right, the error indeed seems to come from the message
> >>>>>>> coalescing feature. If I turn it off using the "--mca
> >>>>>>> btl_openib_use_message_coalescing 0", I'm not able to observe the
> >>>>>>> "hdr->tag=0" error.
> >>>>>>>
> >>>>>>> There are some trac requests associated to very similar error
> >>>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they
> >>>>>>> are all closed (except
> >>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be
> >>>>>>> related), aren't they ? What would you suggest Terry ?
> >>>>>>
> >>>>>> Interesting, though it looks to me like the segv in ticket 2352
> >>>>>> would have happened on the send side instead of the receive side
> >>>>>> like you have. As to what to do next it would be really nice to
> >>>>>> have some sort of reproducer that we can try and debug what is
> >>>>>> really going on. The only other thing to do without a reproducer
> >>>>>> is to inspect the code on the send side to figure out what might
> >>>>>> make it generate at 0 hdr->tag. Or maybe instrument the send side
> >>>>>> to stop when it is about ready to send a 0 hdr->tag and see if we
> >>>>>> can see how the code got there.
> >>>>>>
> >>>>>> I might have some cycles to look at this Monday.
> >>>>>>
> >>>>>> --td
> >>>>>>
> >>>>>>> Eloi
> >>>>>>>
> >>>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
> >>>>>>>> Eloi Gaudry wrote:
> >>>>>>>>> Terry,
> >>>>>>>>>
> >>>>>>>>> No, I haven't tried any other values than P,65536,256,192,128
> >>>>>>>>> yet.
> >>>>>>>>>
> >>>>>>>>> The reason why is quite simple. I've been reading and reading
> >>>>>>>>> again this thread to understand the btl_openib_receive_queues
> >>>>>>>>> meaning and I can't figure out why the default values seem to
> >>>>>>>>> induce the hdr-
> >>>>>>>>>
> >>>>>>>>>> tag=0 issue
> >>>>>>>>>> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php)
> >>>>>>>>>> .
> >>>>>>>>
> >>>>>>>> Yeah, the size of the fragments and number of them really should
> >>>>>>>> not cause this issue. So I too am a little perplexed about it.
> >>>>>>>>
> >>>>>>>>> Do you think that the default shared received queue parameters
> >>>>>>>>> are erroneous for this specific Mellanox card ? Any help on
> >>>>>>>>> finding the proper parameters would actually be much
> >>>>>>>>> appreciated.
> >>>>>>>>
> >>>>>>>> I don't necessarily think it is the queue size for a specific card
> >>>>>>>> but more so the handling of the queues by the BTL when using
> >>>>>>>> certain sizes. At least that is one gut feel I have.
> >>>>>>>>
> >>>>>>>> In my mind the tag being 0 is either something below OMPI is
> >>>>>>>> polluting the data fragment or OMPI's internal protocol is some
> >>>>>>>> how getting messed up. I can imagine (no empirical data here)
> >>>>>>>> the queue sizes could change how the OMPI protocol sets things
> >>>>>>>> up. Another thing may be the coalescing feature in the openib BTL
> >>>>>>>> which tries to gang multiple messages into one packet when
> >>>>>>>> resources are running low. I can see where changing the queue
> >>>>>>>> sizes might affect the coalescing. So, it might be interesting to
> >>>>>>>> turn off the coalescing. You can do that by setting "--mca
> >>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line.
> >>>>>>>>
> >>>>>>>> If that doesn't solve the issue then obviously there must be
> >>>>>>>> something else going on :-).
> >>>>>>>>
> >>>>>>>> Note, the reason I am interested in this is I am seeing a similar
> >>>>>>>> error condition (hdr->tag == 0) on a development system. Though
> >>>>>>>> my failing case fails with np=8 using the connectivity test
> >>>>>>>> program which is mainly point to point and there are not a
> >>>>>>>> significant amount of data transfers going on either.
> >>>>>>>>
> >>>>>>>> --td
> >>>>>>>>
> >>>>>>>>> Eloi
> >>>>>>>>>
> >>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
> >>>>>>>>>> That is interesting. So does the number of processes affect
> >>>>>>>>>> your runs any. The times I've seen hdr->tag be 0 usually has
> >>>>>>>>>> been due to protocol issues. The tag should never be 0. Have
> >>>>>>>>>> you tried to do other receive_queue settings other than the
> >>>>>>>>>> default and the one you mention.
> >>>>>>>>>>
> >>>>>>>>>> I wonder if you did a combination of the two receive queues
> >>>>>>>>>> causes a failure or not. Something like
> >>>>>>>>>>
> >>>>>>>>>> P,128,256,192,128:P,65536,256,192,128
> >>>>>>>>>>
> >>>>>>>>>> I am wondering if it is the first queuing definition causing the
> >>>>>>>>>> issue or possibly the SRQ defined in the default.
> >>>>>>>>>>
> >>>>>>>>>> --td
> >>>>>>>>>>
> >>>>>>>>>> Eloi Gaudry wrote:
> >>>>>>>>>>> Hi Terry,
> >>>>>>>>>>>
> >>>>>>>>>>> The messages being send/received can be of any size, but the
> >>>>>>>>>>> error seems to happen more often with small messages (as an int
> >>>>>>>>>>> being broadcasted or allreduced). The failing communication
> >>>>>>>>>>> differs from one run to another, but some spots are more likely
> >>>>>>>>>>> to be failing than another. And as far as I know, there are
> >>>>>>>>>>> always located next to a small message (an int being
> >>>>>>>>>>> broadcasted for instance) communication. Other typical
> >>>>>>>>>>> messages size are
> >>>>>>>>>>>
> >>>>>>>>>>>> 10k but can be very much larger.
> >>>>>>>>>>>
> >>>>>>>>>>> I've been checking the hca being used, its' from mellanox (with
> >>>>>>>>>>> vendor_part_id=26428). There is no receive_queues parameters
> >>>>>>>>>>> associated to it.
> >>>>>>>>>>>
> >>>>>>>>>>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
> >>>>>>>>>>> [...]
> >>>>>>>>>>>
> >>>>>>>>>>> # A.k.a. ConnectX
> >>>>>>>>>>> [Mellanox Hermon]
> >>>>>>>>>>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
> >>>>>>>>>>> vendor_part_id =
> >>>>>>>>>>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,2
> >>>>>>>>>>> 64 88 use_eager_rdma = 1
> >>>>>>>>>>> mtu = 2048
> >>>>>>>>>>> max_inline_data = 128
> >>>>>>>>>>>
> >>>>>>>>>>> [..]
> >>>>>>>>>>>
> >>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep receive_queues
> >>>>>>>>>>>
> >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256
> >>>>>>>>>>> ,1 92 ,128
> >>>>>>>>>>>
> >>>>>>>>>>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> >>>>>>>>>>>
> >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:data_source:def
> >>>>>>>>>>> au lt value
> >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:status:writable
> >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli
> >>>>>>>>>>> mi t ed, comma delimited list of receive queues:
> >>>>>>>>>>> P,4096,8,6,4:P,32768,8,6,4
> >>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
> >>>>>>>>>>>
> >>>>>>>>>>> I was wondering if these parameters (automatically computed at
> >>>>>>>>>>> openib btl init for what I understood) were not incorrect in
> >>>>>>>>>>> some way and I plugged some others values:
> >>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that values
> >>>>>>>>>>> when encountering a different issue) . Since that, I haven't
> >>>>>>>>>>> been able to observe the segfault (occuring as hrd->tag = 0 in
> >>>>>>>>>>> btl_openib_component.c:2881) yet.
> >>>>>>>>>>>
> >>>>>>>>>>> Eloi
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
> >>>>>>>>>>>
> >>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
> >>>>>>>>>>>> Eloi, I am curious about your problem. Can you tell me what
> >>>>>>>>>>>> size of job it is? Does it always fail on the same bcast, or
> >>>>>>>>>>>> same process?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Eloi Gaudry wrote:
> >>>>>>>>>>>>> Hi Nysal,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for your suggestions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm now able to get the checksum computed and redirected to
> >>>>>>>>>>>>> stdout, thanks (I forgot the "-mca pml_base_verbose 5"
> >>>>>>>>>>>>> option, you were right). I haven't been able to observe the
> >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml
> >>>>>>>>>>>>> csum) but I 'll let you know when I am.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I've got two others question, which may be related to the
> >>>>>>>>>>>>> error observed:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by
> >>>>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm
> >>>>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as
> >>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call,
> >>>>>>>>>>>>> with very small message (one int is being broadcast, for
> >>>>>>>>>>>>> instance) ; i followed the guidelines given at
> >>>>>>>>>>>>> http://icl.cs.utk.edu/open-
> >>>>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
> >>>>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size
> >>>>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the segfaults
> >>>>>>>>>>>>> remains. Does the openib btl handle very small message
> >>>>>>>>>>>>> differently (even with eager_rdma
> >>>>>>>>>>>>> deactivated) than tcp ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma?
> >>>>>>>>>>>> If so then that would possibly be one difference between the
> >>>>>>>>>>>> openib btl and tcp aside from the actual protocol used.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> is there a way to make sure that large messages and small
> >>>>>>>>>>>>> messages are handled the same way ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Do you mean so they all look like eager messages? How large
> >>>>>>>>>>>> of messages are we talking about here 1K, 1M or 10M?
> >>>>>>>>>>>>
> >>>>>>>>>>>> --td
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
> >>>>>>>>>>>>>> Hi Eloi,
> >>>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while
> >>>>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to
> >>>>>>>>>>>>>> the command line. This will print the checksum details for
> >>>>>>>>>>>>>> each fragment sent over the wire. I'm guessing it didnt
> >>>>>>>>>>>>>> catch anything because the BTL failed. The checksum
> >>>>>>>>>>>>>> verification is done in the PML, which the BTL calls via a
> >>>>>>>>>>>>>> callback function. In your case the PML callback is never
> >>>>>>>>>>>>>> called because the hdr->tag is invalid. So enabling
> >>>>>>>>>>>>>> checksum tracing also might not be of much use. Is it the
> >>>>>>>>>>>>>> first Bcast that fails or the nth Bcast and what is the
> >>>>>>>>>>>>>> message size? I'm not sure what could be the problem at
> >>>>>>>>>>>>>> this moment. I'm afraid you will have to debug the BTL to
> >>>>>>>>>>>>>> find out more.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --Nysal
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
> >>>>>>>>>>>>>>> Hi Nysal,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> thanks for your response.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've been unable so far to write a test case that could
> >>>>>>>>>>>>>>> illustrate the hdr->tag=0 error.
> >>>>>>>>>>>>>>> Actually, I'm only observing this issue when running an
> >>>>>>>>>>>>>>> internode computation involving infiniband hardware from
> >>>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
> >>>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I checked, double-checked, and rechecked again every MPI
> >>>>>>>>>>>>>>> use performed during a parallel computation and I couldn't
> >>>>>>>>>>>>>>> find any error so far. The fact that the very
> >>>>>>>>>>>>>>> same parallel computation run flawlessly when using tcp
> >>>>>>>>>>>>>>> (and disabling openib support) might seem to indicate that
> >>>>>>>>>>>>>>> the issue is somewhere located inside the
> >>>>>>>>>>>>>>> openib btl or at the hardware/driver level.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I haven't
> >>>>>>>>>>>>>>> seen any related messages (when hdr->tag=0 and the
> >>>>>>>>>>>>>>> segfaults occurs). Any suggestion ?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
> >>>>>>>>>>>>>>>> Hi Eloi,
> >>>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read the entire
> >>>>>>>>>>>>>>>> email thread, but do you have a test case which can
> >>>>>>>>>>>>>>>> reproduce this error? Without that it will be difficult to
> >>>>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not work for an
> >>>>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce it on an IB
> >>>>>>>>>>>>>>>> system. There is also a PML called csum, you can use it
> >>>>>>>>>>>>>>>> via "-mca pml csum", which will checksum the MPI messages
> >>>>>>>>>>>>>>>> and verify it at the receiver side for any data
> >>>>>>>>>>>>>>>> corruption. You can try using it to see if it is able
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> catch anything.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>>>> --Nysal
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
> >>>>>>>>>>>>>>>>> Hi Nysal,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a
> >>>>>>>>>>>>>>>>> chance to look
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> this error.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Eloi Gaudry
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Free Field Technologies
> >>>>>>>>>>>>>>>>> Company Website: http://www.fft.be
> >>>>>>>>>>>>>>>>> Company Phone: +32 10 487 959
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
> >>>>>>>>>>>>>>>>> From: Eloi Gaudry <eg_at_[hidden]>
> >>>>>>>>>>>>>>>>> To: Open MPI Users <users_at_[hidden]>
> >>>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
> >>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using
> >>>>>>>>>>>>>>>>> openib btl Hi,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I was wondering if anybody got a chance to have a look at
> >>>>>>>>>>>>>>>>> this issue.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>> Hi Jeff,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Please find enclosed the output (valgrind.out.gz) from
> >>>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
> >>>>>>>>>>>>>>>>>> pbn11,pbn10 --mca
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> btl
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca
> >>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0
> >>>>>>>>>>>>>>>>>> -tag-output /opt/valgrind-3.5.0/bin/valgrind
> >>>>>>>>>>>>>>>>>> --tool=memcheck
> >>>>>>>>>>>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/o
> >>>>>>>>>>>>>>>>>> pen mp i- valgrind.supp
> >>>>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp
> >>>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ...
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> >>>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>> I did run our application through valgrind but it
> >>>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there is a bunch
> >>>>>>>>>>>>>>>>>>>>> of "Invalid read" (I'm using
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1.4.2
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of uninitialized
> >>>>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on
> >>>>>>>>>>>>>>>>>>>>> uninitialized bytes" in
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ompi
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> routines. Some of them are located in
> >>>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an output of
> >>>>>>>>>>>>>>>>>>>>> valgrind shortly.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be expected --
> >>>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some of its
> >>>>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware of them (and
> >>>>>>>>>>>>>>>>>>>> therefore incorrectly marks them as
> >>>>>>>>>>>>>>>>>>>> uninitialized).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> would it help if i use the upcoming 1.5 version of
> >>>>>>>>>>>>>>>>>>> openmpi ? i
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> read
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the valgrind
> >>>>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern this btl
> >>>>>>>>>>>>>>>>>>> (for the reasons you mentionned).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Another question, you said that the callback function
> >>>>>>>>>>>>>>>>>>>>> pointer
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an integer.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I was worrying that its value could not be null.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to build
> >>>>>>>>>>>>>>>>>>> libpython without pymalloc first).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for your help,
> >>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
> >>>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function pointer
> >>>>>>>>>>>>>>>>>>>>>> should never
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 0.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory corruption
> >>>>>>>>>>>>>>>>>>>>>> is occurring.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because the stack
> >>>>>>>>>>>>>>>>>>>>>> trace looks like you're calling through python, but
> >>>>>>>>>>>>>>>>>>>>>> can you run this application through valgrind, or
> >>>>>>>>>>>>>>>>>>>>>> some other memory-checking debugger?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values of the
> >>>>>>>>>>>>>>>>>>>>>>> function
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> parameters:
> >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata
> >>>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0
> >>>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super
> >>>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380,
> >>>>>>>>>>>>>>>>>>>>>>> btl_eager_limit =
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 12288,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size =
> >>>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length = 1048576,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency =
> >>>>>>>>>>>>>>>>>>>>>>> 10, btl_bandwidth = 800, btl_flags = 310,
> >>>>>>>>>>>>>>>>>>>>>>> btl_add_procs =
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_del_procs =
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_register = 0, btl_finalize =
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> btl_alloc
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free
> >>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb91400<mca_btl_openib_free>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_prepare_src =
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_prepare_dst
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> =
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_send = 0x2b341eb94517<mca_btl_openib_send>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_sendi = 0x2b341eb9340d<mca_btl_openib_sendi>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_put = 0x2b341eb94660<mca_btl_openib_put>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_get = 0x2b341eb94c4e<mca_btl_openib_get>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_dump = 0x2b341acd45cb<mca_btl_base_dump>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_mpool = 0xf3f4110, btl_register_error =
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>,
> >>>>>>>>>>>>>>>>>>>>>>> btl_ft_event
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> =
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>}
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag
> >>>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0'
> >>>>>>>>>>>>>>>>>>>>>>> (gdb) print des
> >>>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc
> >>>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file generated during
> >>>>>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> segmentation
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call (using
> >>>>>>>>>>>>>>>>>>>>>>>> openib):
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
> >>>>>>>>>>>>>>>>>>>>>>>> (gdb) where
> >>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
> >>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
> >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
> >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2 0x00002aedbc4e25e2
> >>>>>>>>>>>>>>>>>>>>>>>> in handle_wc (device=0x19024ac0, cq=0,
> >>>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d
> >>>>>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> poll_device
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> #4
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (device=0x19024ac0)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5
> >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component_progress () at
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> #6
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
> >>>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497
> >>>>>>>>>>>>>>>>>>>>>>>> in opal_condition_wait (c=0x2aedb888ccc0,
> >>>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at
> >>>>>>>>>>>>>>>>>>>>>>>> ../opal/threads/condition.h:99 #8
> >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in
> >>>>>>>>>>>>>>>>>>>>>>>> ompi_request_default_wait_all
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (count=2,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
> >>>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in
> >>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling
> >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
> >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
> >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
> >>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
> >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
> >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
> >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
> >>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
> >>>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
> >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> op=0x6787a20,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
> >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
> >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
> >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> op=0x6787a20,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13
> >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in FEMTown::Domain::align (itf=
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> er fa ce>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ =
> >>>>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ = 0x6}}},<No data
> >>>>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14
> >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in
> >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors
> >>>>>>>>>>>>>>>>>>>>>>>> (dim=2,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> set={px
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
> >>>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 0x00000000040cbfa8
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px =
> >>>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> =
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at
> >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0,
> >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
> >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17
> >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in
> >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*,
> >>>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object>
> >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050,
> >>>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at
> >>>>>>>>>>>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception.
> >>>>>>>>>>>>>>>>>>>>>>>> hp p: 463
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> #18
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (self=0x0,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
> >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90, throwflag=<value
> >>>>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20
> >>>>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx
> >>>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>,
> >>>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3,
> >>>>>>>>>>>>>>>>>>>>>>>> argcount=1, kws=0x19ace4a0, kwcount=2,
> >>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
> >>>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx
> >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120,
> >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
> >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x7, argcount=1,
> >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19acc418, kwcount=3, defs=0x2aaab759e958,
> >>>>>>>>>>>>>>>>>>>>>>>> defcount=6, closure=0x0) at Python/ceval.c:2968
> >>>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx
> >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738,
> >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
> >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x6, argcount=1,
> >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19abd328, kwcount=5, defs=0x2aaab891b7e8,
> >>>>>>>>>>>>>>>>>>>>>>>> defcount=3, closure=0x0) at Python/ceval.c:2968
> >>>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx
> >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198,
> >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
> >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0xb, argcount=1,
> >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89df0, kwcount=10, defs=0x0, defcount=0,
> >>>>>>>>>>>>>>>>>>>>>>>> closure=0x0) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27 0x0000000004408f58 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx
> >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288,
> >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
> >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x1, argcount=0,
> >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89330, kwcount=0, defs=0x2aaab8b66668,
> >>>>>>>>>>>>>>>>>>>>>>>> defcount=1, closure=0x0) at Python/ceval.c:2968
> >>>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx
> >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in
> >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738,
> >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
> >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x0, argcount=0, kws=0x0,
> >>>>>>>>>>>>>>>>>>>>>>>> kwcount=0, defs=0x0, defcount=0, closure=0x0) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
> >>>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode
> >>>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0, locals=0x190d9700) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32 0x000000000442853c in
> >>>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8
> >>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized
> >>>>>>>>>>>>>>>>>>>>>>>> out>, globals=0x192213d0, locals=0x192213d0,
> >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at Python/pythonrun.c:1335 #33
> >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004429690 in PyRun_SimpleStringFlags
> >>>>>>>>>>>>>>>>>>>>>>>> (command=0x192fd3d8 "DIRECT.Actran.main()",
> >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at
> >>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
> >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (this=0x7ffff279f650)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> FEMTown::Main::Batch::run
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 0x0000000001f9aa25
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at
> >>>>>>>>>>>>>>>>>>>>>>>> main.cpp:10 (gdb) f 1 #1 0x00002aedbc4e05f4 in
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming (openib_btl=0x1902f9b0,
> >>>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
> >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> );
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c
> >>>>>>>>>>>>>>>>>>>>>>>> (gdb)
> >>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
> >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
> >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
> >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
> >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> );
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876
> >>>>>>>>>>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg =
> >>>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /*
> >>>>>>>>>>>>>>>>>>>>>>>> call registered callback */
> >>>>>>>>>>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t*
> >>>>>>>>>>>>>>>>>>>>>>>> reg; 2880 reg =
> >>>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger + hdr->tag;
> >>>>>>>>>>>>>>>>>>>>>>>> 2881 reg->cbfunc(&openib_btl->super, hdr->tag,
> >>>>>>>>>>>>>>>>>>>>>>>> des, reg->cbdata ); 2882
> >>>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
> >>>>>>>>>>>>>>>>>>>>>>>> cqp
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> =
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f;
> >>>>>>>>>>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff;
> >>>>>>>>>>>>>>>>>>>>>>>> 2885 } else {
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar,
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed was that the
> >>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes later
> >>>>>>>>>>>>>>>>>>>>>>>>> during the parallel computation.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use
> >>>>>>>>>>>>>>>>>>>>>>>>> the "--mca
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> coll
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could
> >>>>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow limited to
> >>>>>>>>>>>>>>>>>>>>>>>>> the tuned collective routines.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option
> >>>>>>>>>>>>>>>>>>>>>>>>>>> as well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always
> >>>>>>>>>>>>>>>>>>>>>>>>>>> happened during a collective communication
> >>>>>>>>>>>>>>>>>>>>>>>>>>> indeed... does it basically
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> switch
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic mode, right ?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ?
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand
> >>>>>>>>>>>>>>>>>>>>>>>>>> networking card)
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> module,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.g.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> used to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> observe
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed libraries
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found on a node.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> component.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when selecting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now going to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see if
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules option.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need to add an extra mca parameter
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that tells the library to use dynamic
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selection. --mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 --param
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> coll
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for the various collectives. Therefore,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1" allowed to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear algorithm.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anyway whatever the algorithm used, the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> diagnose
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that seems to randomly segfault when
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using the openib btl. I'd
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault during
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> openib
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> btl
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/hel
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> p/ [pbn08:02624] *** Process received
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signal *** [pbn08:02624] Signal:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segmentation fault (11)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal code: Address
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not mapped
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (1)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (nil) [pbn08:02624] [ 0]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0 [0x349540e4c0]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** End of error
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> message
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ***
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872'
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1'
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (by using --mca btl self,sm,tcp on the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command line, for instance), I don't
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> encounter any problem and the parallel
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation runs flawlessly.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> able: - to diagnose the issue I'm
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facing with the openib btl - understand
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why this issue is observed only when
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure scripts of OpenMPI are
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enclosed to this email, and some
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> information
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> parallel
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using infiniband:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --mca
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 0
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __________________________________________
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ __ _
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> users mailing list
> >>>>>>>>>>>>> users_at_[hidden]
> >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959