Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-09-27 11:22:15


Ok there were no 0 value tags in your files. Are you running this with
no eager RDMA? If not can you set the following options "-mca
btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 -mca
btl_openib_flags 1".

thanks,

--td

Eloi Gaudry wrote:
> Terry,
>
> Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option).
> I'm displaying frag->hdr->tag here.
>
> Eloi
>
> On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
>
>> Eloi, sorry can you print out frag->hdr->tag?
>>
>> Unfortunately from your last email I think it will still all have
>> non-zero values.
>> If that ends up being the case then there must be something odd with the
>> descriptor pointer to the fragment.
>>
>> --td
>>
>> Eloi Gaudry wrote:
>>
>>> Terry,
>>>
>>> Please find enclosed the requested check outputs (using -output-filename
>>> stdout.tag.null option).
>>>
>>> For information, Nysal In his first message referred to
>>> ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on
>>> receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML
>>> + 1)
>>> #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2)
>>> #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3)
>>>
>>> #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4)
>>>
>>> #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5)
>>> #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6)
>>> #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7)
>>>
>>> #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8)
>>>
>>> #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9)
>>> and in ompi/mca/btl/btl.h
>>> #define MCA_BTL_TAG_PML 0x40
>>>
>>> Eloi
>>>
>>> On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
>>>
>>>> I am thinking checking the value of *frag->hdr right before the return
>>>> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
>>>> It is line 548 in the trunk
>>>> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
>>>> ope nib_endpoint.h#548
>>>>
>>>> --td
>>>>
>>>> Eloi Gaudry wrote:
>>>>
>>>>> Hi Terry,
>>>>>
>>>>> Do you have any patch that I could apply to be able to do so ? I'm
>>>>> remotely working on a cluster (with a terminal) and I cannot use any
>>>>> parallel debugger or sequential debugger (with a call to xterm...). I
>>>>> can track frag->hdr->tag value in
>>>>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
>>>>> SEND/RDMA_WRITE case, but this is all I can think of alone.
>>>>>
>>>>> You'll find a stacktrace (receive side) in this thread (10th or 11th
>>>>> message) but it might be pointless.
>>>>>
>>>>> Regards,
>>>>> Eloi
>>>>>
>>>>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
>>>>>
>>>>>> So it sounds like coalescing is not your issue and that the problem
>>>>>> has something to do with the queue sizes. It would be helpful if we
>>>>>> could detect the hdr->tag == 0 issue on the sending side and get at
>>>>>> least a stack trace. There is something really odd going on here.
>>>>>>
>>>>>> --td
>>>>>>
>>>>>> Eloi Gaudry wrote:
>>>>>>
>>>>>>> Hi Terry,
>>>>>>>
>>>>>>> I'm sorry to say that I might have missed a point here.
>>>>>>>
>>>>>>> I've lately been relaunching all previously failing computations with
>>>>>>> the message coalescing feature being switched off, and I saw the same
>>>>>>> hdr->tag=0 error several times, always during a collective call
>>>>>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
>>>>>>> soon as I switched to the peer queue option I was previously using
>>>>>>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
>>>>>>> --mca btl_openib_use_message_coalescing 0), all computations ran
>>>>>>> flawlessly.
>>>>>>>
>>>>>>> As for the reproducer, I've already tried to write something but I
>>>>>>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
>>>>>>>
>>>>>>> Eloi
>>>>>>>
>>>>>>> On 24/09/2010 18:37, Terry Dontje wrote:
>>>>>>>
>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>
>>>>>>>>> Terry,
>>>>>>>>>
>>>>>>>>> You were right, the error indeed seems to come from the message
>>>>>>>>> coalescing feature. If I turn it off using the "--mca
>>>>>>>>> btl_openib_use_message_coalescing 0", I'm not able to observe the
>>>>>>>>> "hdr->tag=0" error.
>>>>>>>>>
>>>>>>>>> There are some trac requests associated to very similar error
>>>>>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they
>>>>>>>>> are all closed (except
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be
>>>>>>>>> related), aren't they ? What would you suggest Terry ?
>>>>>>>>>
>>>>>>>> Interesting, though it looks to me like the segv in ticket 2352
>>>>>>>> would have happened on the send side instead of the receive side
>>>>>>>> like you have. As to what to do next it would be really nice to
>>>>>>>> have some sort of reproducer that we can try and debug what is
>>>>>>>> really going on. The only other thing to do without a reproducer
>>>>>>>> is to inspect the code on the send side to figure out what might
>>>>>>>> make it generate at 0 hdr->tag. Or maybe instrument the send side
>>>>>>>> to stop when it is about ready to send a 0 hdr->tag and see if we
>>>>>>>> can see how the code got there.
>>>>>>>>
>>>>>>>> I might have some cycles to look at this Monday.
>>>>>>>>
>>>>>>>> --td
>>>>>>>>
>>>>>>>>
>>>>>>>>> Eloi
>>>>>>>>>
>>>>>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
>>>>>>>>>
>>>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>>>
>>>>>>>>>>> Terry,
>>>>>>>>>>>
>>>>>>>>>>> No, I haven't tried any other values than P,65536,256,192,128
>>>>>>>>>>> yet.
>>>>>>>>>>>
>>>>>>>>>>> The reason why is quite simple. I've been reading and reading
>>>>>>>>>>> again this thread to understand the btl_openib_receive_queues
>>>>>>>>>>> meaning and I can't figure out why the default values seem to
>>>>>>>>>>> induce the hdr-
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> tag=0 issue
>>>>>>>>>>>> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php)
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>> Yeah, the size of the fragments and number of them really should
>>>>>>>>>> not cause this issue. So I too am a little perplexed about it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Do you think that the default shared received queue parameters
>>>>>>>>>>> are erroneous for this specific Mellanox card ? Any help on
>>>>>>>>>>> finding the proper parameters would actually be much
>>>>>>>>>>> appreciated.
>>>>>>>>>>>
>>>>>>>>>> I don't necessarily think it is the queue size for a specific card
>>>>>>>>>> but more so the handling of the queues by the BTL when using
>>>>>>>>>> certain sizes. At least that is one gut feel I have.
>>>>>>>>>>
>>>>>>>>>> In my mind the tag being 0 is either something below OMPI is
>>>>>>>>>> polluting the data fragment or OMPI's internal protocol is some
>>>>>>>>>> how getting messed up. I can imagine (no empirical data here)
>>>>>>>>>> the queue sizes could change how the OMPI protocol sets things
>>>>>>>>>> up. Another thing may be the coalescing feature in the openib BTL
>>>>>>>>>> which tries to gang multiple messages into one packet when
>>>>>>>>>> resources are running low. I can see where changing the queue
>>>>>>>>>> sizes might affect the coalescing. So, it might be interesting to
>>>>>>>>>> turn off the coalescing. You can do that by setting "--mca
>>>>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line.
>>>>>>>>>>
>>>>>>>>>> If that doesn't solve the issue then obviously there must be
>>>>>>>>>> something else going on :-).
>>>>>>>>>>
>>>>>>>>>> Note, the reason I am interested in this is I am seeing a similar
>>>>>>>>>> error condition (hdr->tag == 0) on a development system. Though
>>>>>>>>>> my failing case fails with np=8 using the connectivity test
>>>>>>>>>> program which is mainly point to point and there are not a
>>>>>>>>>> significant amount of data transfers going on either.
>>>>>>>>>>
>>>>>>>>>> --td
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Eloi
>>>>>>>>>>>
>>>>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
>>>>>>>>>>>
>>>>>>>>>>>> That is interesting. So does the number of processes affect
>>>>>>>>>>>> your runs any. The times I've seen hdr->tag be 0 usually has
>>>>>>>>>>>> been due to protocol issues. The tag should never be 0. Have
>>>>>>>>>>>> you tried to do other receive_queue settings other than the
>>>>>>>>>>>> default and the one you mention.
>>>>>>>>>>>>
>>>>>>>>>>>> I wonder if you did a combination of the two receive queues
>>>>>>>>>>>> causes a failure or not. Something like
>>>>>>>>>>>>
>>>>>>>>>>>> P,128,256,192,128:P,65536,256,192,128
>>>>>>>>>>>>
>>>>>>>>>>>> I am wondering if it is the first queuing definition causing the
>>>>>>>>>>>> issue or possibly the SRQ defined in the default.
>>>>>>>>>>>>
>>>>>>>>>>>> --td
>>>>>>>>>>>>
>>>>>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Terry,
>>>>>>>>>>>>>
>>>>>>>>>>>>> The messages being send/received can be of any size, but the
>>>>>>>>>>>>> error seems to happen more often with small messages (as an int
>>>>>>>>>>>>> being broadcasted or allreduced). The failing communication
>>>>>>>>>>>>> differs from one run to another, but some spots are more likely
>>>>>>>>>>>>> to be failing than another. And as far as I know, there are
>>>>>>>>>>>>> always located next to a small message (an int being
>>>>>>>>>>>>> broadcasted for instance) communication. Other typical
>>>>>>>>>>>>> messages size are
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 10k but can be very much larger.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I've been checking the hca being used, its' from mellanox (with
>>>>>>>>>>>>> vendor_part_id=26428). There is no receive_queues parameters
>>>>>>>>>>>>> associated to it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>
>>>>>>>>>>>>> # A.k.a. ConnectX
>>>>>>>>>>>>> [Mellanox Hermon]
>>>>>>>>>>>>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
>>>>>>>>>>>>> vendor_part_id =
>>>>>>>>>>>>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,2
>>>>>>>>>>>>> 64 88 use_eager_rdma = 1
>>>>>>>>>>>>> mtu = 2048
>>>>>>>>>>>>> max_inline_data = 128
>>>>>>>>>>>>>
>>>>>>>>>>>>> [..]
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep receive_queues
>>>>>>>>>>>>>
>>>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256
>>>>>>>>>>>>> ,1 92 ,128
>>>>>>>>>>>>>
>>>>>>>>>>>>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
>>>>>>>>>>>>>
>>>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:data_source:def
>>>>>>>>>>>>> au lt value
>>>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:status:writable
>>>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli
>>>>>>>>>>>>> mi t ed, comma delimited list of receive queues:
>>>>>>>>>>>>> P,4096,8,6,4:P,32768,8,6,4
>>>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was wondering if these parameters (automatically computed at
>>>>>>>>>>>>> openib btl init for what I understood) were not incorrect in
>>>>>>>>>>>>> some way and I plugged some others values:
>>>>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that values
>>>>>>>>>>>>> when encountering a different issue) . Since that, I haven't
>>>>>>>>>>>>> been able to observe the segfault (occuring as hrd->tag = 0 in
>>>>>>>>>>>>> btl_openib_component.c:2881) yet.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Eloi, I am curious about your problem. Can you tell me what
>>>>>>>>>>>>>> size of job it is? Does it always fail on the same bcast, or
>>>>>>>>>>>>>> same process?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for your suggestions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm now able to get the checksum computed and redirected to
>>>>>>>>>>>>>>> stdout, thanks (I forgot the "-mca pml_base_verbose 5"
>>>>>>>>>>>>>>> option, you were right). I haven't been able to observe the
>>>>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml
>>>>>>>>>>>>>>> csum) but I 'll let you know when I am.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've got two others question, which may be related to the
>>>>>>>>>>>>>>> error observed:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by
>>>>>>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm
>>>>>>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as
>>>>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call,
>>>>>>>>>>>>>>> with very small message (one int is being broadcast, for
>>>>>>>>>>>>>>> instance) ; i followed the guidelines given at
>>>>>>>>>>>>>>> http://icl.cs.utk.edu/open-
>>>>>>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
>>>>>>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size
>>>>>>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the segfaults
>>>>>>>>>>>>>>> remains. Does the openib btl handle very small message
>>>>>>>>>>>>>>> differently (even with eager_rdma
>>>>>>>>>>>>>>> deactivated) than tcp ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma?
>>>>>>>>>>>>>> If so then that would possibly be one difference between the
>>>>>>>>>>>>>> openib btl and tcp aside from the actual protocol used.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is there a way to make sure that large messages and small
>>>>>>>>>>>>>>> messages are handled the same way ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you mean so they all look like eager messages? How large
>>>>>>>>>>>>>> of messages are we talking about here 1K, 1M or 10M?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --td
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Eloi,
>>>>>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while
>>>>>>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to
>>>>>>>>>>>>>>>> the command line. This will print the checksum details for
>>>>>>>>>>>>>>>> each fragment sent over the wire. I'm guessing it didnt
>>>>>>>>>>>>>>>> catch anything because the BTL failed. The checksum
>>>>>>>>>>>>>>>> verification is done in the PML, which the BTL calls via a
>>>>>>>>>>>>>>>> callback function. In your case the PML callback is never
>>>>>>>>>>>>>>>> called because the hdr->tag is invalid. So enabling
>>>>>>>>>>>>>>>> checksum tracing also might not be of much use. Is it the
>>>>>>>>>>>>>>>> first Bcast that fails or the nth Bcast and what is the
>>>>>>>>>>>>>>>> message size? I'm not sure what could be the problem at
>>>>>>>>>>>>>>>> this moment. I'm afraid you will have to debug the BTL to
>>>>>>>>>>>>>>>> find out more.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --Nysal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> thanks for your response.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've been unable so far to write a test case that could
>>>>>>>>>>>>>>>>> illustrate the hdr->tag=0 error.
>>>>>>>>>>>>>>>>> Actually, I'm only observing this issue when running an
>>>>>>>>>>>>>>>>> internode computation involving infiniband hardware from
>>>>>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
>>>>>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I checked, double-checked, and rechecked again every MPI
>>>>>>>>>>>>>>>>> use performed during a parallel computation and I couldn't
>>>>>>>>>>>>>>>>> find any error so far. The fact that the very
>>>>>>>>>>>>>>>>> same parallel computation run flawlessly when using tcp
>>>>>>>>>>>>>>>>> (and disabling openib support) might seem to indicate that
>>>>>>>>>>>>>>>>> the issue is somewhere located inside the
>>>>>>>>>>>>>>>>> openib btl or at the hardware/driver level.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I haven't
>>>>>>>>>>>>>>>>> seen any related messages (when hdr->tag=0 and the
>>>>>>>>>>>>>>>>> segfaults occurs). Any suggestion ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Eloi,
>>>>>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read the entire
>>>>>>>>>>>>>>>>>> email thread, but do you have a test case which can
>>>>>>>>>>>>>>>>>> reproduce this error? Without that it will be difficult to
>>>>>>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not work for an
>>>>>>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce it on an IB
>>>>>>>>>>>>>>>>>> system. There is also a PML called csum, you can use it
>>>>>>>>>>>>>>>>>> via "-mca pml csum", which will checksum the MPI messages
>>>>>>>>>>>>>>>>>> and verify it at the receiver side for any data
>>>>>>>>>>>>>>>>>> corruption. You can try using it to see if it is able
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> catch anything.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> --Nysal
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a
>>>>>>>>>>>>>>>>>>> chance to look
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> this error.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Eloi Gaudry
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Free Field Technologies
>>>>>>>>>>>>>>>>>>> Company Website: http://www.fft.be
>>>>>>>>>>>>>>>>>>> Company Phone: +32 10 487 959
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>>>>>>>>>> From: Eloi Gaudry <eg_at_[hidden]>
>>>>>>>>>>>>>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
>>>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using
>>>>>>>>>>>>>>>>>>> openib btl Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was wondering if anybody got a chance to have a look at
>>>>>>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Jeff,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please find enclosed the output (valgrind.out.gz) from
>>>>>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
>>>>>>>>>>>>>>>>>>>> pbn11,pbn10 --mca
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> btl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca
>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0
>>>>>>>>>>>>>>>>>>>> -tag-output /opt/valgrind-3.5.0/bin/valgrind
>>>>>>>>>>>>>>>>>>>> --tool=memcheck
>>>>>>>>>>>>>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/o
>>>>>>>>>>>>>>>>>>>> pen mp i- valgrind.supp
>>>>>>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp
>>>>>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I did run our application through valgrind but it
>>>>>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there is a bunch
>>>>>>>>>>>>>>>>>>>>>>> of "Invalid read" (I'm using
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1.4.2
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of uninitialized
>>>>>>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on
>>>>>>>>>>>>>>>>>>>>>>> uninitialized bytes" in
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ompi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> routines. Some of them are located in
>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an output of
>>>>>>>>>>>>>>>>>>>>>>> valgrind shortly.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be expected --
>>>>>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some of its
>>>>>>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware of them (and
>>>>>>>>>>>>>>>>>>>>>> therefore incorrectly marks them as
>>>>>>>>>>>>>>>>>>>>>> uninitialized).
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> would it help if i use the upcoming 1.5 version of
>>>>>>>>>>>>>>>>>>>>> openmpi ? i
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the valgrind
>>>>>>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern this btl
>>>>>>>>>>>>>>>>>>>>> (for the reasons you mentionned).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Another question, you said that the callback function
>>>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an integer.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I was worrying that its value could not be null.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to build
>>>>>>>>>>>>>>>>>>>>> libpython without pymalloc first).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function pointer
>>>>>>>>>>>>>>>>>>>>>>>> should never
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 0.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory corruption
>>>>>>>>>>>>>>>>>>>>>>>> is occurring.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because the stack
>>>>>>>>>>>>>>>>>>>>>>>> trace looks like you're calling through python, but
>>>>>>>>>>>>>>>>>>>>>>>> can you run this application through valgrind, or
>>>>>>>>>>>>>>>>>>>>>>>> some other memory-checking debugger?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values of the
>>>>>>>>>>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> parameters:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0
>>>>>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super
>>>>>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_eager_limit =
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 12288,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size =
>>>>>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length = 1048576,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency =
>>>>>>>>>>>>>>>>>>>>>>>>> 10, btl_bandwidth = 800, btl_flags = 310,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_add_procs =
>>>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_del_procs =
>>>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_register = 0, btl_finalize =
>>>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> btl_alloc
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free
>>>>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb91400<mca_btl_openib_free>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_prepare_src =
>>>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_prepare_dst
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_send = 0x2b341eb94517<mca_btl_openib_send>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_sendi = 0x2b341eb9340d<mca_btl_openib_sendi>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_put = 0x2b341eb94660<mca_btl_openib_put>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_get = 0x2b341eb94c4e<mca_btl_openib_get>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_dump = 0x2b341acd45cb<mca_btl_base_dump>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_mpool = 0xf3f4110, btl_register_error =
>>>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>,
>>>>>>>>>>>>>>>>>>>>>>>>> btl_ft_event
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>}
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag
>>>>>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0'
>>>>>>>>>>>>>>>>>>>>>>>>> (gdb) print des
>>>>>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
>>>>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc
>>>>>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file generated during
>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> segmentation
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call (using
>>>>>>>>>>>>>>>>>>>>>>>>>> openib):
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
>>>>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
>>>>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2 0x00002aedbc4e25e2
>>>>>>>>>>>>>>>>>>>>>>>>>> in handle_wc (device=0x19024ac0, cq=0,
>>>>>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> poll_device
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> #4
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (device=0x19024ac0)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component_progress () at
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> #6
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
>>>>>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497
>>>>>>>>>>>>>>>>>>>>>>>>>> in opal_condition_wait (c=0x2aedb888ccc0,
>>>>>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at
>>>>>>>>>>>>>>>>>>>>>>>>>> ../opal/threads/condition.h:99 #8
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in
>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_request_default_wait_all
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (count=2,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in
>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling
>>>>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
>>>>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
>>>>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
>>>>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
>>>>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
>>>>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
>>>>>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> op=0x6787a20,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
>>>>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> op=0x6787a20,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in FEMTown::Domain::align (itf=
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> er fa ce>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ =
>>>>>>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ = 0x6}}},<No data
>>>>>>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in
>>>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors
>>>>>>>>>>>>>>>>>>>>>>>>>> (dim=2,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> set={px
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
>>>>>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 0x00000000040cbfa8
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px =
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at
>>>>>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0,
>>>>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
>>>>>>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in
>>>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*,
>>>>>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object>
>>>>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050,
>>>>>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at
>>>>>>>>>>>>>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception.
>>>>>>>>>>>>>>>>>>>>>>>>>> hp p: 463
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> #18
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (self=0x0,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
>>>>>>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90, throwflag=<value
>>>>>>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx
>>>>>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3,
>>>>>>>>>>>>>>>>>>>>>>>>>> argcount=1, kws=0x19ace4a0, kwcount=2,
>>>>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120,
>>>>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
>>>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x7, argcount=1,
>>>>>>>>>>>>>>>>>>>>>>>>>> kws=0x19acc418, kwcount=3, defs=0x2aaab759e958,
>>>>>>>>>>>>>>>>>>>>>>>>>> defcount=6, closure=0x0) at Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738,
>>>>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
>>>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x6, argcount=1,
>>>>>>>>>>>>>>>>>>>>>>>>>> kws=0x19abd328, kwcount=5, defs=0x2aaab891b7e8,
>>>>>>>>>>>>>>>>>>>>>>>>>> defcount=3, closure=0x0) at Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198,
>>>>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
>>>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0xb, argcount=1,
>>>>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89df0, kwcount=10, defs=0x0, defcount=0,
>>>>>>>>>>>>>>>>>>>>>>>>>> closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27 0x0000000004408f58 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288,
>>>>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
>>>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x1, argcount=0,
>>>>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89330, kwcount=0, defs=0x2aaab8b66668,
>>>>>>>>>>>>>>>>>>>>>>>>>> defcount=1, closure=0x0) at Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738,
>>>>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>, locals=<value
>>>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x0, argcount=0, kws=0x0,
>>>>>>>>>>>>>>>>>>>>>>>>>> kwcount=0, defs=0x0, defcount=0, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode
>>>>>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0, locals=0x190d9700) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32 0x000000000442853c in
>>>>>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8
>>>>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized
>>>>>>>>>>>>>>>>>>>>>>>>>> out>, globals=0x192213d0, locals=0x192213d0,
>>>>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at Python/pythonrun.c:1335 #33
>>>>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004429690 in PyRun_SimpleStringFlags
>>>>>>>>>>>>>>>>>>>>>>>>>> (command=0x192fd3d8 "DIRECT.Actran.main()",
>>>>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
>>>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (this=0x7ffff279f650)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> FEMTown::Main::Batch::run
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 0x0000000001f9aa25
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at
>>>>>>>>>>>>>>>>>>>>>>>>>> main.cpp:10 (gdb) f 1 #1 0x00002aedbc4e05f4 in
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming (openib_btl=0x1902f9b0,
>>>>>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
>>>>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> );
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c
>>>>>>>>>>>>>>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
>>>>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
>>>>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
>>>>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> );
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876
>>>>>>>>>>>>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg =
>>>>>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /*
>>>>>>>>>>>>>>>>>>>>>>>>>> call registered callback */
>>>>>>>>>>>>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t*
>>>>>>>>>>>>>>>>>>>>>>>>>> reg; 2880 reg =
>>>>>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger + hdr->tag;
>>>>>>>>>>>>>>>>>>>>>>>>>> 2881 reg->cbfunc(&openib_btl->super, hdr->tag,
>>>>>>>>>>>>>>>>>>>>>>>>>> des, reg->cbdata ); 2882
>>>>>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
>>>>>>>>>>>>>>>>>>>>>>>>>> cqp
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f;
>>>>>>>>>>>>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff;
>>>>>>>>>>>>>>>>>>>>>>>>>> 2885 } else {
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed was that the
>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes later
>>>>>>>>>>>>>>>>>>>>>>>>>>> during the parallel computation.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use
>>>>>>>>>>>>>>>>>>>>>>>>>>> the "--mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> coll
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could
>>>>>>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow limited to
>>>>>>>>>>>>>>>>>>>>>>>>>>> the tuned collective routines.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> happened during a collective communication
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> indeed... does it basically
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> switch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic mode, right ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand
>>>>>>>>>>>>>>>>>>>>>>>>>>>> networking card)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> used to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> observe
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed libraries
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found on a node.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> component.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when selecting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now going to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules option.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need to add an extra mca parameter
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that tells the library to use dynamic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selection. --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 --param
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> coll
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for the various collectives. Therefore,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1" allowed to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear algorithm.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anyway whatever the algorithm used, the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diagnose
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that seems to randomly segfault when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using the openib btl. I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault during
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> openib
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> btl
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/hel
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> p/ [pbn08:02624] *** Process received
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signal *** [pbn08:02624] Signal:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segmentation fault (11)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal code: Address
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not mapped
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (1)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (nil) [pbn08:02624] [ 0]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0 [0x349540e4c0]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** End of error
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ***
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (by using --mca btl self,sm,tcp on the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command line, for instance), I don't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> encounter any problem and the parallel
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation runs flawlessly.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> able: - to diagnose the issue I'm
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facing with the openib btl - understand
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why this issue is observed only when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure scripts of OpenMPI are
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enclosed to this email, and some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> parallel
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --mca
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ __ _
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>
>
>

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture