Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Eloi Gaudry (eg_at_[hidden])
Date: 2012-01-31 04:34:29


Hi,

I just would like to give you an update on this issue.
Since we are using OpenMPI-1.4.4, we cannot reproduce it anymore.

Regards,
Eloi

On 09/29/2010 06:01 AM, Nysal Jan wrote:
> Hi Eloi,
> We discussed this issue during the weekly developer meeting & there
> were no further suggestions, apart from checking the driver and
> firmware levels. The consensus was that it would be better if you
> could take this up directly with your IB vendor.
>
> Regards
> --Nysal
>
> On Mon, Sep 27, 2010 at 8:14 PM, Eloi Gaudry <eg_at_[hidden]
> <mailto:eg_at_[hidden]>> wrote:
>
> Terry,
>
> Please find enclosed the requested check outputs (using
> -output-filename stdout.tag.null option).
> I'm displaying frag->hdr->tag here.
>
> Eloi
>
> On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
> > Eloi, sorry can you print out frag->hdr->tag?
> >
> > Unfortunately from your last email I think it will still all have
> > non-zero values.
> > If that ends up being the case then there must be something odd
> with the
> > descriptor pointer to the fragment.
> >
> > --td
> >
> > Eloi Gaudry wrote:
> > > Terry,
> > >
> > > Please find enclosed the requested check outputs (using
> -output-filename
> > > stdout.tag.null option).
> > >
> > > For information, Nysal In his first message referred to
> > > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was
> wrnong on
> > > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH
> (MCA_BTL_TAG_PML
> > > + 1)
> > > #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2)
> > > #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3)
> > >
> > > #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4)
> > >
> > > #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5)
> > > #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6)
> > > #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7)
> > >
> > > #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8)
> > >
> > > #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9)
> > > and in ompi/mca/btl/btl.h
> > > #define MCA_BTL_TAG_PML 0x40
> > >
> > > Eloi
> > >
> > > On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
> > >> I am thinking checking the value of *frag->hdr right before
> the return
> > >> in the post_send function in
> ompi/mca/btl/openib/btl_openib_endpoint.h.
> > >> It is line 548 in the trunk
> > >>
> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
> > >> ope nib_endpoint.h#548
> > >>
> > >> --td
> > >>
> > >> Eloi Gaudry wrote:
> > >>> Hi Terry,
> > >>>
> > >>> Do you have any patch that I could apply to be able to do so
> ? I'm
> > >>> remotely working on a cluster (with a terminal) and I cannot
> use any
> > >>> parallel debugger or sequential debugger (with a call to
> xterm...). I
> > >>> can track frag->hdr->tag value in
> > >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
> > >>> SEND/RDMA_WRITE case, but this is all I can think of alone.
> > >>>
> > >>> You'll find a stacktrace (receive side) in this thread (10th
> or 11th
> > >>> message) but it might be pointless.
> > >>>
> > >>> Regards,
> > >>> Eloi
> > >>>
> > >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
> > >>>> So it sounds like coalescing is not your issue and that the
> problem
> > >>>> has something to do with the queue sizes. It would be
> helpful if we
> > >>>> could detect the hdr->tag == 0 issue on the sending side
> and get at
> > >>>> least a stack trace. There is something really odd going
> on here.
> > >>>>
> > >>>> --td
> > >>>>
> > >>>> Eloi Gaudry wrote:
> > >>>>> Hi Terry,
> > >>>>>
> > >>>>> I'm sorry to say that I might have missed a point here.
> > >>>>>
> > >>>>> I've lately been relaunching all previously failing
> computations with
> > >>>>> the message coalescing feature being switched off, and I
> saw the same
> > >>>>> hdr->tag=0 error several times, always during a collective
> call
> > >>>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so
> far). And as
> > >>>>> soon as I switched to the peer queue option I was
> previously using
> > >>>>> (--mca btl_openib_receive_queues P,65536,256,192,128
> instead of using
> > >>>>> --mca btl_openib_use_message_coalescing 0), all
> computations ran
> > >>>>> flawlessly.
> > >>>>>
> > >>>>> As for the reproducer, I've already tried to write
> something but I
> > >>>>> haven't succeeded so far at reproducing the hdr->tag=0
> issue with it.
> > >>>>>
> > >>>>> Eloi
> > >>>>>
> > >>>>> On 24/09/2010 18:37, Terry Dontje wrote:
> > >>>>>> Eloi Gaudry wrote:
> > >>>>>>> Terry,
> > >>>>>>>
> > >>>>>>> You were right, the error indeed seems to come from the
> message
> > >>>>>>> coalescing feature. If I turn it off using the "--mca
> > >>>>>>> btl_openib_use_message_coalescing 0", I'm not able to
> observe the
> > >>>>>>> "hdr->tag=0" error.
> > >>>>>>>
> > >>>>>>> There are some trac requests associated to very similar
> error
> > >>>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing)
> but they
> > >>>>>>> are all closed (except
> > >>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be
> > >>>>>>> related), aren't they ? What would you suggest Terry ?
> > >>>>>>
> > >>>>>> Interesting, though it looks to me like the segv in
> ticket 2352
> > >>>>>> would have happened on the send side instead of the
> receive side
> > >>>>>> like you have. As to what to do next it would be really
> nice to
> > >>>>>> have some sort of reproducer that we can try and debug
> what is
> > >>>>>> really going on. The only other thing to do without a
> reproducer
> > >>>>>> is to inspect the code on the send side to figure out
> what might
> > >>>>>> make it generate at 0 hdr->tag. Or maybe instrument the
> send side
> > >>>>>> to stop when it is about ready to send a 0 hdr->tag and
> see if we
> > >>>>>> can see how the code got there.
> > >>>>>>
> > >>>>>> I might have some cycles to look at this Monday.
> > >>>>>>
> > >>>>>> --td
> > >>>>>>
> > >>>>>>> Eloi
> > >>>>>>>
> > >>>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
> > >>>>>>>> Eloi Gaudry wrote:
> > >>>>>>>>> Terry,
> > >>>>>>>>>
> > >>>>>>>>> No, I haven't tried any other values than
> P,65536,256,192,128
> > >>>>>>>>> yet.
> > >>>>>>>>>
> > >>>>>>>>> The reason why is quite simple. I've been reading and
> reading
> > >>>>>>>>> again this thread to understand the
> btl_openib_receive_queues
> > >>>>>>>>> meaning and I can't figure out why the default values
> seem to
> > >>>>>>>>> induce the hdr-
> > >>>>>>>>>
> > >>>>>>>>>> tag=0 issue
> > >>>>>>>>>>
> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php)
> > >>>>>>>>>> .
> > >>>>>>>>
> > >>>>>>>> Yeah, the size of the fragments and number of them
> really should
> > >>>>>>>> not cause this issue. So I too am a little perplexed
> about it.
> > >>>>>>>>
> > >>>>>>>>> Do you think that the default shared received queue
> parameters
> > >>>>>>>>> are erroneous for this specific Mellanox card ? Any
> help on
> > >>>>>>>>> finding the proper parameters would actually be much
> > >>>>>>>>> appreciated.
> > >>>>>>>>
> > >>>>>>>> I don't necessarily think it is the queue size for a
> specific card
> > >>>>>>>> but more so the handling of the queues by the BTL when
> using
> > >>>>>>>> certain sizes. At least that is one gut feel I have.
> > >>>>>>>>
> > >>>>>>>> In my mind the tag being 0 is either something below
> OMPI is
> > >>>>>>>> polluting the data fragment or OMPI's internal protocol
> is some
> > >>>>>>>> how getting messed up. I can imagine (no empirical
> data here)
> > >>>>>>>> the queue sizes could change how the OMPI protocol sets
> things
> > >>>>>>>> up. Another thing may be the coalescing feature in the
> openib BTL
> > >>>>>>>> which tries to gang multiple messages into one packet when
> > >>>>>>>> resources are running low. I can see where changing
> the queue
> > >>>>>>>> sizes might affect the coalescing. So, it might be
> interesting to
> > >>>>>>>> turn off the coalescing. You can do that by setting "--mca
> > >>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line.
> > >>>>>>>>
> > >>>>>>>> If that doesn't solve the issue then obviously there
> must be
> > >>>>>>>> something else going on :-).
> > >>>>>>>>
> > >>>>>>>> Note, the reason I am interested in this is I am seeing
> a similar
> > >>>>>>>> error condition (hdr->tag == 0) on a development
> system. Though
> > >>>>>>>> my failing case fails with np=8 using the connectivity test
> > >>>>>>>> program which is mainly point to point and there are not a
> > >>>>>>>> significant amount of data transfers going on either.
> > >>>>>>>>
> > >>>>>>>> --td
> > >>>>>>>>
> > >>>>>>>>> Eloi
> > >>>>>>>>>
> > >>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
> > >>>>>>>>>> That is interesting. So does the number of processes
> affect
> > >>>>>>>>>> your runs any. The times I've seen hdr->tag be 0
> usually has
> > >>>>>>>>>> been due to protocol issues. The tag should never be
> 0. Have
> > >>>>>>>>>> you tried to do other receive_queue settings other
> than the
> > >>>>>>>>>> default and the one you mention.
> > >>>>>>>>>>
> > >>>>>>>>>> I wonder if you did a combination of the two receive
> queues
> > >>>>>>>>>> causes a failure or not. Something like
> > >>>>>>>>>>
> > >>>>>>>>>> P,128,256,192,128:P,65536,256,192,128
> > >>>>>>>>>>
> > >>>>>>>>>> I am wondering if it is the first queuing definition
> causing the
> > >>>>>>>>>> issue or possibly the SRQ defined in the default.
> > >>>>>>>>>>
> > >>>>>>>>>> --td
> > >>>>>>>>>>
> > >>>>>>>>>> Eloi Gaudry wrote:
> > >>>>>>>>>>> Hi Terry,
> > >>>>>>>>>>>
> > >>>>>>>>>>> The messages being send/received can be of any size,
> but the
> > >>>>>>>>>>> error seems to happen more often with small messages
> (as an int
> > >>>>>>>>>>> being broadcasted or allreduced). The failing
> communication
> > >>>>>>>>>>> differs from one run to another, but some spots are
> more likely
> > >>>>>>>>>>> to be failing than another. And as far as I know,
> there are
> > >>>>>>>>>>> always located next to a small message (an int being
> > >>>>>>>>>>> broadcasted for instance) communication. Other typical
> > >>>>>>>>>>> messages size are
> > >>>>>>>>>>>
> > >>>>>>>>>>>> 10k but can be very much larger.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I've been checking the hca being used, its' from
> mellanox (with
> > >>>>>>>>>>> vendor_part_id=26428). There is no receive_queues
> parameters
> > >>>>>>>>>>> associated to it.
> > >>>>>>>>>>>
> > >>>>>>>>>>> $ cat
> share/openmpi/mca-btl-openib-device-params.ini as well:
> > >>>>>>>>>>> [...]
> > >>>>>>>>>>>
> > >>>>>>>>>>> # A.k.a. ConnectX
> > >>>>>>>>>>> [Mellanox Hermon]
> > >>>>>>>>>>> vendor_id =
> 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
> > >>>>>>>>>>> vendor_part_id =
> > >>>>>>>>>>>
> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,2
> > >>>>>>>>>>> 64 88 use_eager_rdma = 1
> > >>>>>>>>>>> mtu = 2048
> > >>>>>>>>>>> max_inline_data = 128
> > >>>>>>>>>>>
> > >>>>>>>>>>> [..]
> > >>>>>>>>>>>
> > >>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep
> receive_queues
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256
> > >>>>>>>>>>> ,1 92 ,128
> > >>>>>>>>>>>
> > >>>>>>>>>>> :S
> ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> mca:btl:openib:param:btl_openib_receive_queues:data_source:def
> > >>>>>>>>>>> au lt value
> > >>>>>>>>>>>
> mca:btl:openib:param:btl_openib_receive_queues:status:writable
> > >>>>>>>>>>>
> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli
> > >>>>>>>>>>> mi t ed, comma delimited list of receive queues:
> > >>>>>>>>>>> P,4096,8,6,4:P,32768,8,6,4
> > >>>>>>>>>>>
> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
> > >>>>>>>>>>>
> > >>>>>>>>>>> I was wondering if these parameters (automatically
> computed at
> > >>>>>>>>>>> openib btl init for what I understood) were not
> incorrect in
> > >>>>>>>>>>> some way and I plugged some others values:
> > >>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that
> values
> > >>>>>>>>>>> when encountering a different issue) . Since that, I
> haven't
> > >>>>>>>>>>> been able to observe the segfault (occuring as
> hrd->tag = 0 in
> > >>>>>>>>>>> btl_openib_component.c:2881) yet.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Eloi
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje
> wrote:
> > >>>>>>>>>>>> Eloi, I am curious about your problem. Can you
> tell me what
> > >>>>>>>>>>>> size of job it is? Does it always fail on the same
> bcast, or
> > >>>>>>>>>>>> same process?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Eloi Gaudry wrote:
> > >>>>>>>>>>>>> Hi Nysal,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for your suggestions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm now able to get the checksum computed and
> redirected to
> > >>>>>>>>>>>>> stdout, thanks (I forgot the "-mca
> pml_base_verbose 5"
> > >>>>>>>>>>>>> option, you were right). I haven't been able to
> observe the
> > >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when
> using pml
> > >>>>>>>>>>>>> csum) but I 'll let you know when I am.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I've got two others question, which may be related
> to the
> > >>>>>>>>>>>>> error observed:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be
> handled by
> > >>>>>>>>>>>>> OpenMPI somehow depends on the btl being used
> (i.e. if I'm
> > >>>>>>>>>>>>> using openib, may I use the same number of
> MPI_Comm object as
> > >>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in
> OpenMPI ?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2/ the segfaults only appears during a mpi
> collective call,
> > >>>>>>>>>>>>> with very small message (one int is being
> broadcast, for
> > >>>>>>>>>>>>> instance) ; i followed the guidelines given at
> > >>>>>>>>>>>>> http://icl.cs.utk.edu/open-
> > >>>>>>>>>>>>>
> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
> > >>>>>>>>>>>>> debug-build of OpenMPI asserts if I use a
> different min-size
> > >>>>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the
> segfaults
> > >>>>>>>>>>>>> remains. Does the openib btl handle very small message
> > >>>>>>>>>>>>> differently (even with eager_rdma
> > >>>>>>>>>>>>> deactivated) than tcp ?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Others on the list does coalescing happen with
> non-eager_rdma?
> > >>>>>>>>>>>> If so then that would possibly be one difference
> between the
> > >>>>>>>>>>>> openib btl and tcp aside from the actual protocol used.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> is there a way to make sure that large messages
> and small
> > >>>>>>>>>>>>> messages are handled the same way ?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Do you mean so they all look like eager messages?
> How large
> > >>>>>>>>>>>> of messages are we talking about here 1K, 1M or 10M?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --td
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
> > >>>>>>>>>>>>>> Hi Eloi,
> > >>>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug)
> and while
> > >>>>>>>>>>>>>> running with the csum PML add "-mca
> pml_base_verbose 5" to
> > >>>>>>>>>>>>>> the command line. This will print the checksum
> details for
> > >>>>>>>>>>>>>> each fragment sent over the wire. I'm guessing it
> didnt
> > >>>>>>>>>>>>>> catch anything because the BTL failed. The checksum
> > >>>>>>>>>>>>>> verification is done in the PML, which the BTL
> calls via a
> > >>>>>>>>>>>>>> callback function. In your case the PML callback
> is never
> > >>>>>>>>>>>>>> called because the hdr->tag is invalid. So enabling
> > >>>>>>>>>>>>>> checksum tracing also might not be of much use.
> Is it the
> > >>>>>>>>>>>>>> first Bcast that fails or the nth Bcast and what
> is the
> > >>>>>>>>>>>>>> message size? I'm not sure what could be the
> problem at
> > >>>>>>>>>>>>>> this moment. I'm afraid you will have to debug
> the BTL to
> > >>>>>>>>>>>>>> find out more.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> --Nysal
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry
> <eg_at_[hidden] <mailto:eg_at_[hidden]>> wrote:
> > >>>>>>>>>>>>>>> Hi Nysal,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> thanks for your response.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I've been unable so far to write a test case
> that could
> > >>>>>>>>>>>>>>> illustrate the hdr->tag=0 error.
> > >>>>>>>>>>>>>>> Actually, I'm only observing this issue when
> running an
> > >>>>>>>>>>>>>>> internode computation involving infiniband
> hardware from
> > >>>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
> > >>>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I checked, double-checked, and rechecked again
> every MPI
> > >>>>>>>>>>>>>>> use performed during a parallel computation and
> I couldn't
> > >>>>>>>>>>>>>>> find any error so far. The fact that the very
> > >>>>>>>>>>>>>>> same parallel computation run flawlessly when
> using tcp
> > >>>>>>>>>>>>>>> (and disabling openib support) might seem to
> indicate that
> > >>>>>>>>>>>>>>> the issue is somewhere located inside the
> > >>>>>>>>>>>>>>> openib btl or at the hardware/driver level.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I
> haven't
> > >>>>>>>>>>>>>>> seen any related messages (when hdr->tag=0 and the
> > >>>>>>>>>>>>>>> segfaults occurs). Any suggestion ?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan
> wrote:
> > >>>>>>>>>>>>>>>> Hi Eloi,
> > >>>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read
> the entire
> > >>>>>>>>>>>>>>>> email thread, but do you have a test case which can
> > >>>>>>>>>>>>>>>> reproduce this error? Without that it will be
> difficult to
> > >>>>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not
> work for an
> > >>>>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce
> it on an IB
> > >>>>>>>>>>>>>>>> system. There is also a PML called csum, you
> can use it
> > >>>>>>>>>>>>>>>> via "-mca pml csum", which will checksum the
> MPI messages
> > >>>>>>>>>>>>>>>> and verify it at the receiver side for any data
> > >>>>>>>>>>>>>>>> corruption. You can try using it to see if it
> is able
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> catch anything.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>>>>> --Nysal
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry
> <eg_at_[hidden] <mailto:eg_at_[hidden]>> wrote:
> > >>>>>>>>>>>>>>>>> Hi Nysal,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if
> you had a
> > >>>>>>>>>>>>>>>>> chance to look
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> at
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> this error.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Eloi Gaudry
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Free Field Technologies
> > >>>>>>>>>>>>>>>>> Company Website: http://www.fft.be
> > >>>>>>>>>>>>>>>>> Company Phone: +32 10 487 959
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
> > >>>>>>>>>>>>>>>>> From: Eloi Gaudry <eg_at_[hidden] <mailto:eg_at_[hidden]>>
> > >>>>>>>>>>>>>>>>> To: Open MPI Users <users_at_[hidden]
> <mailto:users_at_[hidden]>>
> > >>>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
> > >>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault
> when using
> > >>>>>>>>>>>>>>>>> openib btl Hi,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I was wondering if anybody got a chance to
> have a look at
> > >>>>>>>>>>>>>>>>> this issue.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi
> Gaudry wrote:
> > >>>>>>>>>>>>>>>>>> Hi Jeff,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Please find enclosed the output
> (valgrind.out.gz) from
> > >>>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
> > >>>>>>>>>>>>>>>>>> pbn11,pbn10 --mca
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> btl
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca
> > >>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
> btl_openib_want_fork_support 0
> > >>>>>>>>>>>>>>>>>> -tag-output /opt/valgrind-3.5.0/bin/valgrind
> > >>>>>>>>>>>>>>>>>> --tool=memcheck
> > >>>>>>>>>>>>>>>>>>
> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/o
> > >>>>>>>>>>>>>>>>>> pen mp i- valgrind.supp
> > >>>>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp
> > >>>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ...
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi
> Gaudry wrote:
> > >>>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff
> Squyres wrote:
> > >>>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry
> wrote:
> > >>>>>>>>>>>>>>>>>>>>> I did run our application through valgrind
> but it
> > >>>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there
> is a bunch
> > >>>>>>>>>>>>>>>>>>>>> of "Invalid read" (I'm using
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 1.4.2
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of
> uninitialized
> > >>>>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on
> > >>>>>>>>>>>>>>>>>>>>> uninitialized bytes" in
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> ompi
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> routines. Some of them are located in
> > >>>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an
> output of
> > >>>>>>>>>>>>>>>>>>>>> valgrind shortly.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be
> expected --
> > >>>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some
> of its
> > >>>>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware
> of them (and
> > >>>>>>>>>>>>>>>>>>>> therefore incorrectly marks them as
> > >>>>>>>>>>>>>>>>>>>> uninitialized).
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> would it help if i use the upcoming 1.5
> version of
> > >>>>>>>>>>>>>>>>>>> openmpi ? i
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> read
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the
> valgrind
> > >>>>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern
> this btl
> > >>>>>>>>>>>>>>>>>>> (for the reasons you mentionned).
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Another question, you said that the
> callback function
> > >>>>>>>>>>>>>>>>>>>>> pointer
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null
> (hdr->tag) ?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an
> integer.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I was worrying that its value could not be null.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to
> build
> > >>>>>>>>>>>>>>>>>>> libpython without pymalloc first).
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks for your help,
> > >>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
> > >>>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function
> pointer
> > >>>>>>>>>>>>>>>>>>>>>> should never
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 0.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory
> corruption
> > >>>>>>>>>>>>>>>>>>>>>> is occurring.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because
> the stack
> > >>>>>>>>>>>>>>>>>>>>>> trace looks like you're calling through
> python, but
> > >>>>>>>>>>>>>>>>>>>>>> can you run this application through
> valgrind, or
> > >>>>>>>>>>>>>>>>>>>>>> some other memory-checking debugger?
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry
> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values
> of the
> > >>>>>>>>>>>>>>>>>>>>>>> function
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> parameters:
> > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata
> > >>>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0
> > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super
> > >>>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_eager_limit =
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 12288,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288,
> btl_max_send_size =
> > >>>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length =
> 1048576,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024,
> btl_latency =
> > >>>>>>>>>>>>>>>>>>>>>>> 10, btl_bandwidth = 800, btl_flags = 310,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_add_procs =
> > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_del_procs =
> > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_register = 0, btl_finalize =
> > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> btl_alloc
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> =
> 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free
> > >>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb91400<mca_btl_openib_free>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_prepare_src =
> > >>>>>>>>>>>>>>>>>>>>>>>
> 0x2b341eb91813<mca_btl_openib_prepare_src>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_prepare_dst
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> =
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_send =
> 0x2b341eb94517<mca_btl_openib_send>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_sendi =
> 0x2b341eb9340d<mca_btl_openib_sendi>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_put =
> 0x2b341eb94660<mca_btl_openib_put>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_get =
> 0x2b341eb94c4e<mca_btl_openib_get>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_dump =
> 0x2b341acd45cb<mca_btl_base_dump>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_mpool = 0xf3f4110,
> btl_register_error =
> > >>>>>>>>>>>>>>>>>>>>>>>
> 0x2b341eb90565<mca_btl_openib_register_error_cb>,
> > >>>>>>>>>>>>>>>>>>>>>>> btl_ft_event
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> =
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>}
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag
> > >>>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0'
> > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print des
> > >>>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc
> > >>>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi
> Gaudry wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file
> generated during
> > >>>>>>>>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> segmentation
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call
> (using
> > >>>>>>>>>>>>>>>>>>>>>>>> openib):
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
> > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) where
> > >>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
> > >>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
> > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
> > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2
> 0x00002aedbc4e25e2
> > >>>>>>>>>>>>>>>>>>>>>>>> in handle_wc (device=0x19024ac0, cq=0,
> > >>>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3
> 0x00002aedbc4e2e9d
> > >>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> poll_device
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> #4
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> (device=0x19024ac0)
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component_progress () at
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> #6
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
> > >>>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7
> 0x00002aedb859f497
> > >>>>>>>>>>>>>>>>>>>>>>>> in opal_condition_wait (c=0x2aedb888ccc0,
> > >>>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at
> > >>>>>>>>>>>>>>>>>>>>>>>> ../opal/threads/condition.h:99 #8
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in
> > >>>>>>>>>>>>>>>>>>>>>>>> ompi_request_default_wait_all
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> (count=2,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
> > >>>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9
> 0x00002aedbd7559ad in
> > >>>>>>>>>>>>>>>>>>>>>>>>
> ompi_coll_tuned_allreduce_intra_recursivedoubling
> > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
> > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
> > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
> > >>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
> > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
> > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
> > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
> > >>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
> > >>>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
> > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> op=0x6787a20,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce
> (sendbuf=0x7ffff279d444,
> > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
> > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> op=0x6787a20,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in
> FEMTown::Domain::align (itf=
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> er fa ce>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr =
> 0x7ffff279d620, ptr_ =
> > >>>>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ =
> 0x6}}},<No data
> > >>>>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in
> > >>>>>>>>>>>>>>>>>>>>>>>>
> FEMTown::Field::detail::align_itfs_and_neighbhors
> > >>>>>>>>>>>>>>>>>>>>>>>> (dim=2,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> set={px
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ =
> 0x2f279d640}},
> > >>>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at
> check.cpp:63 #15
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 0x00000000040cbfa8
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements
> (set={px =
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> =
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}},
> check_info=@0x7ffff279d7f0) at
> > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0,
> > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
> > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in
> > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object*
> (*)(_object*,
> > >>>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object>
> > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0,
> po1=0x2aaab0765050,
> > >>>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at
> > >>>>>>>>>>>>>>>>>>>>>>>>
> /home/qa/svntop/femtown/modules/main/py/exception.
> > >>>>>>>>>>>>>>>>>>>>>>>> hp p: 463
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> #18
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in
> PyField_align_elements_ewrap
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> (self=0x0,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
> > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90,
> throwflag=<value
> > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx
> > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value
> optimized out>,
> > >>>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3,
> > >>>>>>>>>>>>>>>>>>>>>>>> argcount=1, kws=0x19ace4a0, kwcount=2,
> > >>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2,
> closure=0x0) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
> > >>>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in
> PyEval_EvalFrameEx
> > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value
> optimized out>) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22
> 0x000000000440aae9 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120,
> > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
> locals=<value
> > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x7, argcount=1,
> > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19acc418, kwcount=3,
> defs=0x2aaab759e958,
> > >>>>>>>>>>>>>>>>>>>>>>>> defcount=6, closure=0x0) at
> Python/ceval.c:2968
> > >>>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in
> PyEval_EvalFrameEx
> > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value
> optimized out>) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24
> 0x000000000440aae9 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738,
> > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
> locals=<value
> > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x6, argcount=1,
> > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19abd328, kwcount=5,
> defs=0x2aaab891b7e8,
> > >>>>>>>>>>>>>>>>>>>>>>>> defcount=3, closure=0x0) at
> Python/ceval.c:2968
> > >>>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in
> PyEval_EvalFrameEx
> > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value
> optimized out>) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26
> 0x000000000440aae9 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198,
> > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
> locals=<value
> > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0xb, argcount=1,
> > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89df0, kwcount=10, defs=0x0,
> defcount=0,
> > >>>>>>>>>>>>>>>>>>>>>>>> closure=0x0) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27
> 0x0000000004408f58 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx
> > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value
> optimized out>) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28
> 0x000000000440aae9 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288,
> > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
> locals=<value
> > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x1, argcount=0,
> > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89330, kwcount=0,
> defs=0x2aaab8b66668,
> > >>>>>>>>>>>>>>>>>>>>>>>> defcount=1, closure=0x0) at
> Python/ceval.c:2968
> > >>>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in
> PyEval_EvalFrameEx
> > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value
> optimized out>) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30
> 0x000000000440aae9 in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738,
> > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
> locals=<value
> > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x0, argcount=0,
> kws=0x0,
> > >>>>>>>>>>>>>>>>>>>>>>>> kwcount=0, defs=0x0, defcount=0,
> closure=0x0) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
> > >>>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode
> > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0,
> locals=0x190d9700) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32
> 0x000000000442853c in
> > >>>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8
> > >>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value
> optimized
> > >>>>>>>>>>>>>>>>>>>>>>>> out>, globals=0x192213d0,
> locals=0x192213d0,
> > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at Python/pythonrun.c:1335 #33
> > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004429690 in
> PyRun_SimpleStringFlags
> > >>>>>>>>>>>>>>>>>>>>>>>> (command=0x192fd3d8 "DIRECT.Actran.main()",
> > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at
> > >>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34
> 0x0000000001fa1cf9 in
> > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> (this=0x7ffff279f650)
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> FEMTown::Main::Batch::run
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 0x0000000001f9aa25
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at
> > >>>>>>>>>>>>>>>>>>>>>>>> main.cpp:10 (gdb) f 1 #1
> 0x00002aedbc4e05f4 in
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
> (openib_btl=0x1902f9b0,
> > >>>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700,
> byte_len=18) at
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881
> reg->cbfunc(
> > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des,
> reg->cbdata
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> );
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c
> > >>>>>>>>>>>>>>>>>>>>>>>> (gdb)
> > >>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
> > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
> > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
> > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881
> reg->cbfunc(
> > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des,
> reg->cbdata
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> );
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876
> > >>>>>>>>>>>>>>>>>>>>>>>> 2877
> if(OPAL_LIKELY(!(is_credit_msg =
> > >>>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878
> /*
> > >>>>>>>>>>>>>>>>>>>>>>>> call registered callback */
> > >>>>>>>>>>>>>>>>>>>>>>>> 2879
> mca_btl_active_message_callback_t*
> > >>>>>>>>>>>>>>>>>>>>>>>> reg; 2880 reg =
> > >>>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger +
> hdr->tag;
> > >>>>>>>>>>>>>>>>>>>>>>>> 2881 reg->cbfunc(&openib_btl->super,
> hdr->tag,
> > >>>>>>>>>>>>>>>>>>>>>>>> des, reg->cbdata ); 2882
> > >>>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
> > >>>>>>>>>>>>>>>>>>>>>>>> cqp
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> =
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f;
> > >>>>>>>>>>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff;
> > >>>>>>>>>>>>>>>>>>>>>>>> 2885 } else {
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi
> Gaudry wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed
> was that the
> > >>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes
> later
> > >>>>>>>>>>>>>>>>>>>>>>>>> during the parallel computation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I
> could use
> > >>>>>>>>>>>>>>>>>>>>>>>>> the "--mca
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> coll
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so
> that I could
> > >>>>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow
> limited to
> > >>>>>>>>>>>>>>>>>>>>>>>>> the tuned collective routines.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24
> Edgar Gabriel wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try
> this option
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> as well.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> happened during a collective
> communication
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> indeed... does it basically
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> switch
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic
> mode, right ?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a
> NCA ?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand
> > >>>>>>>>>>>>>>>>>>>>>>>>>> networking card)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54
> Edgar Gabriel wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the
> algorithms in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.g.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a
> difference. I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> used to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> observe
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in
> the openib
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed
> libraries
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found
> on a node.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry
> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid
> of that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when
> selecting
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now
> going to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and
> MPI_Recv)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see if
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53
> Eloi Gaudry wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're
> right, I miss
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules
> option.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation
> fault
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm
> using the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast
> algorithms,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need to add an extra mca
> parameter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that tells the library to use
> dynamic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selection. --mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are
> typing this in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the
> following:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca
> coll_tuned_use_dynamic_rules
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 --param
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> coll
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output
> with all the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be
> selected
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for the various collectives.
> Therefore,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need this:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca
> coll_tuned_use_dynamic_rules 1 --mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry
> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1"
> allowed to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear
> algorithm.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anyway whatever the algorithm
> used, the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some
> advice on ways
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> diagnose
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58
> Eloi Gaudry wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast
> routine
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that seems to randomly
> segfault when
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using the openib btl. I'd
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to
> make OpenMPI
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the
> default one
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010
> 11:06:52 Eloi Gaudry wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random
> segmentation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault during
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel
> computation involving
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> openib
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> btl
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same
> issue can be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> http://www.open-mpi.org/community/hel
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> p/ [pbn08:02624] ***
> Process received
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signal *** [pbn08:02624]
> Signal:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segmentation fault (11)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal code:
> Address
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not mapped
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> (1)
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at
> address:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (nil) [pbn08:02624] [ 0]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0
> [0x349540e4c0]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** End of error
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> message
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ***
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624
> Segmentation fault
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872'
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL'
> '--t_max=0.1'
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the
> openib btl
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (by using --mca btl
> self,sm,tcp on the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command line, for instance),
> I don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> encounter any problem and the
> parallel
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation runs flawlessly.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help
> to be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> able: - to diagnose the issue I'm
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facing with the openib btl -
> understand
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why this issue is observed
> only when
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much
> appreciated.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure scripts of OpenMPI are
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enclosed to this email, and some
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> information
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as
> well.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used
> when
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> parallel
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile
> host.list --mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp
> --display-map
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
> 0 [...]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if
> not using infiniband:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile
> host.list --mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp
> --display-map --verbose
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> --mca
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 0
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> __________________________________________
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ __ _
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> _______________________________________________
> > >>>>>>>>>>>>> users mailing list
> > >>>>>>>>>>>>> users_at_[hidden] <mailto:users_at_[hidden]>
> > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
>
>
> Eloi Gaudry
>
> Free Field Technologies
> Company Website: http://www.fft.be
> Company Phone: +32 10 487 959
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
*Eloi Gaudry*
Senior Product and Development Engineer -- HPC & IT Manager
Company phone: 	+32 10 45 12 26 	Direct line: 	+32 10 49 51 47
Company fax: 	+32 10 45 46 26 	Email: 	eloi.gaudry_at_[hidden]
Website: 	www.fft.be <http://www.fft.be> 	
	
	FFT logo <http://www.fft.be>



ligne.jpg
FFT_logo.jpg