Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-09-27 10:29:12


Eloi, sorry can you print out frag->hdr->tag?

Unfortunately from your last email I think it will still all have
non-zero values.
If that ends up being the case then there must be something odd with the
descriptor pointer to the fragment.

--td

Eloi Gaudry wrote:
> Terry,
>
> Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option).
>
> For information, Nysal In his first message referred to ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on receiving side.
> #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
> #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2)
> #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3)
> #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4)
> #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5)
> #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6)
> #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7)
> #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8)
> #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9)
> and in ompi/mca/btl/btl.h
> #define MCA_BTL_TAG_PML 0x40
>
> Eloi
>
> On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
>
>> I am thinking checking the value of *frag->hdr right before the return
>> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
>> It is line 548 in the trunk
>> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope
>> nib_endpoint.h#548
>>
>> --td
>>
>> Eloi Gaudry wrote:
>>
>>> Hi Terry,
>>>
>>> Do you have any patch that I could apply to be able to do so ? I'm
>>> remotely working on a cluster (with a terminal) and I cannot use any
>>> parallel debugger or sequential debugger (with a call to xterm...). I
>>> can track frag->hdr->tag value in
>>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
>>> SEND/RDMA_WRITE case, but this is all I can think of alone.
>>>
>>> You'll find a stacktrace (receive side) in this thread (10th or 11th
>>> message) but it might be pointless.
>>>
>>> Regards,
>>> Eloi
>>>
>>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
>>>
>>>> So it sounds like coalescing is not your issue and that the problem has
>>>> something to do with the queue sizes. It would be helpful if we could
>>>> detect the hdr->tag == 0 issue on the sending side and get at least a
>>>> stack trace. There is something really odd going on here.
>>>>
>>>> --td
>>>>
>>>> Eloi Gaudry wrote:
>>>>
>>>>> Hi Terry,
>>>>>
>>>>> I'm sorry to say that I might have missed a point here.
>>>>>
>>>>> I've lately been relaunching all previously failing computations with
>>>>> the message coalescing feature being switched off, and I saw the same
>>>>> hdr->tag=0 error several times, always during a collective call
>>>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
>>>>> soon as I switched to the peer queue option I was previously using
>>>>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
>>>>> --mca btl_openib_use_message_coalescing 0), all computations ran
>>>>> flawlessly.
>>>>>
>>>>> As for the reproducer, I've already tried to write something but I
>>>>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
>>>>>
>>>>> Eloi
>>>>>
>>>>> On 24/09/2010 18:37, Terry Dontje wrote:
>>>>>
>>>>>> Eloi Gaudry wrote:
>>>>>>
>>>>>>> Terry,
>>>>>>>
>>>>>>> You were right, the error indeed seems to come from the message
>>>>>>> coalescing feature. If I turn it off using the "--mca
>>>>>>> btl_openib_use_message_coalescing 0", I'm not able to observe the
>>>>>>> "hdr->tag=0" error.
>>>>>>>
>>>>>>> There are some trac requests associated to very similar error
>>>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are
>>>>>>> all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352
>>>>>>> that might be related), aren't they ? What would you suggest Terry ?
>>>>>>>
>>>>>> Interesting, though it looks to me like the segv in ticket 2352 would
>>>>>> have happened on the send side instead of the receive side like you
>>>>>> have. As to what to do next it would be really nice to have some
>>>>>> sort of reproducer that we can try and debug what is really going
>>>>>> on. The only other thing to do without a reproducer is to inspect
>>>>>> the code on the send side to figure out what might make it generate
>>>>>> at 0 hdr->tag. Or maybe instrument the send side to stop when it is
>>>>>> about ready to send a 0 hdr->tag and see if we can see how the code
>>>>>> got there.
>>>>>>
>>>>>> I might have some cycles to look at this Monday.
>>>>>>
>>>>>> --td
>>>>>>
>>>>>>
>>>>>>> Eloi
>>>>>>>
>>>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
>>>>>>>
>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>
>>>>>>>>> Terry,
>>>>>>>>>
>>>>>>>>> No, I haven't tried any other values than P,65536,256,192,128 yet.
>>>>>>>>>
>>>>>>>>> The reason why is quite simple. I've been reading and reading again
>>>>>>>>> this thread to understand the btl_openib_receive_queues meaning and
>>>>>>>>> I can't figure out why the default values seem to induce the hdr-
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> tag=0 issue
>>>>>>>>>> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php).
>>>>>>>>>>
>>>>>>>> Yeah, the size of the fragments and number of them really should not
>>>>>>>> cause this issue. So I too am a little perplexed about it.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Do you think that the default shared received queue parameters are
>>>>>>>>> erroneous for this specific Mellanox card ? Any help on finding the
>>>>>>>>> proper parameters would actually be much appreciated.
>>>>>>>>>
>>>>>>>> I don't necessarily think it is the queue size for a specific card
>>>>>>>> but more so the handling of the queues by the BTL when using
>>>>>>>> certain sizes. At least that is one gut feel I have.
>>>>>>>>
>>>>>>>> In my mind the tag being 0 is either something below OMPI is
>>>>>>>> polluting the data fragment or OMPI's internal protocol is some how
>>>>>>>> getting messed up. I can imagine (no empirical data here) the
>>>>>>>> queue sizes could change how the OMPI protocol sets things up.
>>>>>>>> Another thing may be the coalescing feature in the openib BTL which
>>>>>>>> tries to gang multiple messages into one packet when resources are
>>>>>>>> running low. I can see where changing the queue sizes might
>>>>>>>> affect the coalescing. So, it might be interesting to turn off the
>>>>>>>> coalescing. You can do that by setting "--mca
>>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line.
>>>>>>>>
>>>>>>>> If that doesn't solve the issue then obviously there must be
>>>>>>>> something else going on :-).
>>>>>>>>
>>>>>>>> Note, the reason I am interested in this is I am seeing a similar
>>>>>>>> error condition (hdr->tag == 0) on a development system. Though my
>>>>>>>> failing case fails with np=8 using the connectivity test program
>>>>>>>> which is mainly point to point and there are not a significant
>>>>>>>> amount of data transfers going on either.
>>>>>>>>
>>>>>>>> --td
>>>>>>>>
>>>>>>>>
>>>>>>>>> Eloi
>>>>>>>>>
>>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
>>>>>>>>>
>>>>>>>>>> That is interesting. So does the number of processes affect your
>>>>>>>>>> runs any. The times I've seen hdr->tag be 0 usually has been due
>>>>>>>>>> to protocol issues. The tag should never be 0. Have you tried to
>>>>>>>>>> do other receive_queue settings other than the default and the one
>>>>>>>>>> you mention.
>>>>>>>>>>
>>>>>>>>>> I wonder if you did a combination of the two receive queues causes
>>>>>>>>>> a failure or not. Something like
>>>>>>>>>>
>>>>>>>>>> P,128,256,192,128:P,65536,256,192,128
>>>>>>>>>>
>>>>>>>>>> I am wondering if it is the first queuing definition causing the
>>>>>>>>>> issue or possibly the SRQ defined in the default.
>>>>>>>>>>
>>>>>>>>>> --td
>>>>>>>>>>
>>>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Terry,
>>>>>>>>>>>
>>>>>>>>>>> The messages being send/received can be of any size, but the
>>>>>>>>>>> error seems to happen more often with small messages (as an int
>>>>>>>>>>> being broadcasted or allreduced). The failing communication
>>>>>>>>>>> differs from one run to another, but some spots are more likely
>>>>>>>>>>> to be failing than another. And as far as I know, there are
>>>>>>>>>>> always located next to a small message (an int being broadcasted
>>>>>>>>>>> for instance) communication. Other typical messages size are
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> 10k but can be very much larger.
>>>>>>>>>>>>
>>>>>>>>>>> I've been checking the hca being used, its' from mellanox (with
>>>>>>>>>>> vendor_part_id=26428). There is no receive_queues parameters
>>>>>>>>>>> associated to it.
>>>>>>>>>>>
>>>>>>>>>>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
>>>>>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>> # A.k.a. ConnectX
>>>>>>>>>>> [Mellanox Hermon]
>>>>>>>>>>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
>>>>>>>>>>> vendor_part_id =
>>>>>>>>>>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,264
>>>>>>>>>>> 88 use_eager_rdma = 1
>>>>>>>>>>> mtu = 2048
>>>>>>>>>>> max_inline_data = 128
>>>>>>>>>>>
>>>>>>>>>>> [..]
>>>>>>>>>>>
>>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep receive_queues
>>>>>>>>>>>
>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,1
>>>>>>>>>>> 92 ,128
>>>>>>>>>>>
>>>>>>>>>>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
>>>>>>>>>>>
>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:data_source:defau
>>>>>>>>>>> lt value
>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:status:writable
>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimi
>>>>>>>>>>> t ed, comma delimited list of receive queues:
>>>>>>>>>>> P,4096,8,6,4:P,32768,8,6,4
>>>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>>>>>>>>>>>
>>>>>>>>>>> I was wondering if these parameters (automatically computed at
>>>>>>>>>>> openib btl init for what I understood) were not incorrect in some
>>>>>>>>>>> way and I plugged some others values: "P,65536,256,192,128"
>>>>>>>>>>> (someone on the list used that values when encountering a
>>>>>>>>>>> different issue) . Since that, I haven't been able to observe the
>>>>>>>>>>> segfault (occuring as hrd->tag = 0 in
>>>>>>>>>>> btl_openib_component.c:2881) yet.
>>>>>>>>>>>
>>>>>>>>>>> Eloi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
>>>>>>>>>>>
>>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Eloi, I am curious about your problem. Can you tell me what
>>>>>>>>>>>> size of job it is? Does it always fail on the same bcast, or
>>>>>>>>>>>> same process?
>>>>>>>>>>>>
>>>>>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your suggestions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm now able to get the checksum computed and redirected to
>>>>>>>>>>>>> stdout, thanks (I forgot the "-mca pml_base_verbose 5" option,
>>>>>>>>>>>>> you were right). I haven't been able to observe the
>>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml
>>>>>>>>>>>>> csum) but I 'll let you know when I am.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've got two others question, which may be related to the error
>>>>>>>>>>>>> observed:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by
>>>>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm
>>>>>>>>>>>>> using openib, may I use the same number of MPI_Comm object as
>>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call,
>>>>>>>>>>>>> with very small message (one int is being broadcast, for
>>>>>>>>>>>>> instance) ; i followed the guidelines given at
>>>>>>>>>>>>> http://icl.cs.utk.edu/open-
>>>>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
>>>>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size
>>>>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the segfaults
>>>>>>>>>>>>> remains. Does the openib btl handle very small message
>>>>>>>>>>>>> differently (even with eager_rdma
>>>>>>>>>>>>> deactivated) than tcp ?
>>>>>>>>>>>>>
>>>>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma?
>>>>>>>>>>>> If so then that would possibly be one difference between the
>>>>>>>>>>>> openib btl and tcp aside from the actual protocol used.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> is there a way to make sure that large messages and small
>>>>>>>>>>>>> messages are handled the same way ?
>>>>>>>>>>>>>
>>>>>>>>>>>> Do you mean so they all look like eager messages? How large of
>>>>>>>>>>>> messages are we talking about here 1K, 1M or 10M?
>>>>>>>>>>>>
>>>>>>>>>>>> --td
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Eloi,
>>>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while
>>>>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to the
>>>>>>>>>>>>>> command line. This will print the checksum details for each
>>>>>>>>>>>>>> fragment sent over the wire. I'm guessing it didnt catch
>>>>>>>>>>>>>> anything because the BTL failed. The checksum verification is
>>>>>>>>>>>>>> done in the PML, which the BTL calls via a callback function.
>>>>>>>>>>>>>> In your case the PML callback is never called because the
>>>>>>>>>>>>>> hdr->tag is invalid. So enabling checksum tracing also might
>>>>>>>>>>>>>> not be of much use. Is it the first Bcast that fails or the
>>>>>>>>>>>>>> nth Bcast and what is the message size? I'm not sure what
>>>>>>>>>>>>>> could be the problem at this moment. I'm afraid you will have
>>>>>>>>>>>>>> to debug the BTL to find out more.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --Nysal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thanks for your response.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've been unable so far to write a test case that could
>>>>>>>>>>>>>>> illustrate the hdr->tag=0 error.
>>>>>>>>>>>>>>> Actually, I'm only observing this issue when running an
>>>>>>>>>>>>>>> internode computation involving infiniband hardware from
>>>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
>>>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I checked, double-checked, and rechecked again every MPI use
>>>>>>>>>>>>>>> performed during a parallel computation and I couldn't find
>>>>>>>>>>>>>>> any error so far. The fact that the very
>>>>>>>>>>>>>>> same parallel computation run flawlessly when using tcp (and
>>>>>>>>>>>>>>> disabling openib support) might seem to indicate that the
>>>>>>>>>>>>>>> issue is somewhere located inside the
>>>>>>>>>>>>>>> openib btl or at the hardware/driver level.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I haven't seen
>>>>>>>>>>>>>>> any related messages (when hdr->tag=0 and the segfaults
>>>>>>>>>>>>>>> occurs). Any suggestion ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Eloi,
>>>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read the entire
>>>>>>>>>>>>>>>> email thread, but do you have a test case which can
>>>>>>>>>>>>>>>> reproduce this error? Without that it will be difficult to
>>>>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not work for an
>>>>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce it on an IB
>>>>>>>>>>>>>>>> system. There is also a PML called csum, you can use it via
>>>>>>>>>>>>>>>> "-mca pml csum", which will checksum the MPI messages and
>>>>>>>>>>>>>>>> verify it at the receiver side for any data
>>>>>>>>>>>>>>>> corruption. You can try using it to see if it is able
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> catch anything.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> --Nysal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a
>>>>>>>>>>>>>>>>> chance to look
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> this error.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Eloi Gaudry
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Free Field Technologies
>>>>>>>>>>>>>>>>> Company Website: http://www.fft.be
>>>>>>>>>>>>>>>>> Company Phone: +32 10 487 959
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>>>>>>>> From: Eloi Gaudry <eg_at_[hidden]>
>>>>>>>>>>>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using
>>>>>>>>>>>>>>>>> openib btl Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I was wondering if anybody got a chance to have a look at
>>>>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Jeff,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please find enclosed the output (valgrind.out.gz) from
>>>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
>>>>>>>>>>>>>>>>>> pbn11,pbn10 --mca
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> btl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca mpi_warn_on_fork
>>>>>>>>>>>>>>>>>> 0 --mca btl_openib_want_fork_support 0 -tag-output
>>>>>>>>>>>>>>>>>> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
>>>>>>>>>>>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/open
>>>>>>>>>>>>>>>>>> mp i- valgrind.supp
>>>>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp
>>>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I did run our application through valgrind but it
>>>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there is a bunch of
>>>>>>>>>>>>>>>>>>>>> "Invalid read" (I'm using
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1.4.2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of uninitialized
>>>>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on
>>>>>>>>>>>>>>>>>>>>> uninitialized bytes" in
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ompi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> routines. Some of them are located in
>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an output of
>>>>>>>>>>>>>>>>>>>>> valgrind shortly.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be expected --
>>>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some of its
>>>>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware of them (and
>>>>>>>>>>>>>>>>>>>> therefore incorrectly marks them as
>>>>>>>>>>>>>>>>>>>> uninitialized).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> would it help if i use the upcoming 1.5 version of
>>>>>>>>>>>>>>>>>>> openmpi ? i
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the valgrind
>>>>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern this btl
>>>>>>>>>>>>>>>>>>> (for the reasons you mentionned).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Another question, you said that the callback function
>>>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an integer.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was worrying that its value could not be null.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to build
>>>>>>>>>>>>>>>>>>> libpython without pymalloc first).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function pointer
>>>>>>>>>>>>>>>>>>>>>> should never
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 0.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory corruption
>>>>>>>>>>>>>>>>>>>>>> is occurring.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because the stack trace
>>>>>>>>>>>>>>>>>>>>>> looks like you're calling through python, but can you
>>>>>>>>>>>>>>>>>>>>>> run this application through valgrind, or some other
>>>>>>>>>>>>>>>>>>>>>> memory-checking debugger?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values of the
>>>>>>>>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> parameters:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0
>>>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super
>>>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit
>>>>>>>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 12288,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size =
>>>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length = 1048576,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency =
>>>>>>>>>>>>>>>>>>>>>>> 10, btl_bandwidth = 800, btl_flags = 310,
>>>>>>>>>>>>>>>>>>>>>>> btl_add_procs =
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>,
>>>>>>>>>>>>>>>>>>>>>>> btl_del_procs =
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>,
>>>>>>>>>>>>>>>>>>>>>>> btl_register = 0, btl_finalize =
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> btl_alloc
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free =
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91400<mca_btl_openib_free>,
>>>>>>>>>>>>>>>>>>>>>>> btl_prepare_src =
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>,
>>>>>>>>>>>>>>>>>>>>>>> btl_prepare_dst
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>,
>>>>>>>>>>>>>>>>>>>>>>> btl_send = 0x2b341eb94517<mca_btl_openib_send>,
>>>>>>>>>>>>>>>>>>>>>>> btl_sendi = 0x2b341eb9340d<mca_btl_openib_sendi>,
>>>>>>>>>>>>>>>>>>>>>>> btl_put = 0x2b341eb94660<mca_btl_openib_put>,
>>>>>>>>>>>>>>>>>>>>>>> btl_get = 0x2b341eb94c4e<mca_btl_openib_get>,
>>>>>>>>>>>>>>>>>>>>>>> btl_dump = 0x2b341acd45cb<mca_btl_base_dump>,
>>>>>>>>>>>>>>>>>>>>>>> btl_mpool = 0xf3f4110, btl_register_error =
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>,
>>>>>>>>>>>>>>>>>>>>>>> btl_ft_event
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>}
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag
>>>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0'
>>>>>>>>>>>>>>>>>>>>>>> (gdb) print des
>>>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
>>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc
>>>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file generated during a
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> segmentation
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call (using
>>>>>>>>>>>>>>>>>>>>>>>> openib):
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming
>>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
>>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2 0x00002aedbc4e25e2 in
>>>>>>>>>>>>>>>>>>>>>>>> handle_wc (device=0x19024ac0, cq=0,
>>>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d
>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> poll_device
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #4
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (device=0x19024ac0)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5
>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in btl_openib_component_progress
>>>>>>>>>>>>>>>>>>>>>>>> () at
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> #6
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
>>>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in
>>>>>>>>>>>>>>>>>>>>>>>> opal_condition_wait (c=0x2aedb888ccc0,
>>>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at ../opal/threads/condition.h:99
>>>>>>>>>>>>>>>>>>>>>>>> #8
>>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in ompi_request_default_wait_all
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (count=2,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in
>>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling
>>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20,
>>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
>>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
>>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0,
>>>>>>>>>>>>>>>>>>>>>>>> module=0x19d82b20) at
>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
>>>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> op=0x6787a20,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
>>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
>>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> op=0x6787a20,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13
>>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in FEMTown::Domain::align (itf=
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> er fa ce>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ =
>>>>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ = 0x6}}},<No data
>>>>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14
>>>>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in
>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors
>>>>>>>>>>>>>>>>>>>>>>>> (dim=2,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> set={px
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
>>>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 0x00000000040cbfa8
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px =
>>>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at
>>>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
>>>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0,
>>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
>>>>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17
>>>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in
>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*,
>>>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object>
>>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050,
>>>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at
>>>>>>>>>>>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception.hp
>>>>>>>>>>>>>>>>>>>>>>>> p: 463
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #18
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (self=0x0,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
>>>>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in
>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90, throwflag=<value
>>>>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20
>>>>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx
>>>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3, argcount=1,
>>>>>>>>>>>>>>>>>>>>>>>> kws=0x19ace4a0, kwcount=2,
>>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120, globals=<value
>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>> args=0x7, argcount=1, kws=0x19acc418, kwcount=3,
>>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=<value
>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>> args=0x6, argcount=1, kws=0x19abd328, kwcount=5,
>>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=<value
>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>> args=0xb, argcount=1, kws=0x19a89df0, kwcount=10,
>>>>>>>>>>>>>>>>>>>>>>>> defs=0x0, defcount=0, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27 0x0000000004408f58 in
>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=<value
>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>> args=0x1, argcount=0, kws=0x19a89330, kwcount=0,
>>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=<value
>>>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>> args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0,
>>>>>>>>>>>>>>>>>>>>>>>> defcount=0, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode
>>>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0, locals=0x190d9700) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32 0x000000000442853c in
>>>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8
>>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in
>>>>>>>>>>>>>>>>>>>>>>>> PyRun_SimpleStringFlags (command=0x192fd3d8
>>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", flags=0x0) at
>>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
>>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (this=0x7ffff279f650)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> FEMTown::Main::Batch::run
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0x0000000001f9aa25
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10
>>>>>>>>>>>>>>>>>>>>>>>> (gdb) f 1 #1 0x00002aedbc4e05f4 in
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming (openib_btl=0x1902f9b0,
>>>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
>>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> );
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c
>>>>>>>>>>>>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming
>>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
>>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
>>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> );
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876
>>>>>>>>>>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg =
>>>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /* call
>>>>>>>>>>>>>>>>>>>>>>>> registered callback */
>>>>>>>>>>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t*
>>>>>>>>>>>>>>>>>>>>>>>> reg; 2880 reg =
>>>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger + hdr->tag; 2881
>>>>>>>>>>>>>>>>>>>>>>>> reg->cbfunc(&openib_btl->super, hdr->tag, des,
>>>>>>>>>>>>>>>>>>>>>>>> reg->cbdata ); 2882
>>>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
>>>>>>>>>>>>>>>>>>>>>>>> cqp
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f;
>>>>>>>>>>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff;
>>>>>>>>>>>>>>>>>>>>>>>> 2885 } else {
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed was that the
>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes later during
>>>>>>>>>>>>>>>>>>>>>>>>> the parallel computation.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use
>>>>>>>>>>>>>>>>>>>>>>>>> the "--mca
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> coll
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could
>>>>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow limited to the
>>>>>>>>>>>>>>>>>>>>>>>>> tuned collective routines.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option as
>>>>>>>>>>>>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always happened
>>>>>>>>>>>>>>>>>>>>>>>>>>> during a collective communication indeed... does
>>>>>>>>>>>>>>>>>>>>>>>>>>> it basically
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> switch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic mode, right ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand networking
>>>>>>>>>>>>>>>>>>>>>>>>>> card)
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> basic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I used
>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> observe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib
>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective
>>>>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed libraries
>>>>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found on a node.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic
>>>>>>>>>>>>>>>>>>>>>>>>>>>> component.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when selecting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now going to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see if
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules option.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms, you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> need to add an extra mca parameter that tells
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the library to use dynamic selection. --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --param
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> coll
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the various collectives. Therefore, you need
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1" allowed to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear algorithm. Anyway
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whatever the algorithm used, the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diagnose
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems to randomly segfault when using the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> openib btl. I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> openib
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> btl
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/help/
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** Process received
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> signal *** [pbn08:02624] Signal:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segmentation fault (11) [pbn08:02624]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Signal code: Address not mapped
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (1)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address: (nil)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] [ 0]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0 [0x349540e4c0]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** End of error
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ***
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl (by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using --mca btl self,sm,tcp on the command
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line, for instance), I don't encounter any
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem and the parallel computation runs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flawlessly.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be able:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - to diagnose the issue I'm facing with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl - understand why this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue is observed only when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the configure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scripts of OpenMPI are enclosed to this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> email, and some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> parallel
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca mpi_warn_on_fork
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --mca btl_openib_want_fork_support 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $NPROCESS --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --mca
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ____________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ _
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>
>
>

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture