Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-09-27 08:36:59


I am thinking checking the value of *frag->hdr right before the return
in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
It is line 548 in the trunk
https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_openib_endpoint.h#548

--td

Eloi Gaudry wrote:
> Hi Terry,
>
> Do you have any patch that I could apply to be able to do so ? I'm remotely working on a cluster (with a terminal) and I cannot use any parallel debugger or sequential debugger (with a call to
> xterm...). I can track frag->hdr->tag value in ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the SEND/RDMA_WRITE case, but this is all I can think of alone.
>
> You'll find a stacktrace (receive side) in this thread (10th or 11th message) but it might be pointless.
>
> Regards,
> Eloi
>
>
> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
>
>> So it sounds like coalescing is not your issue and that the problem has
>> something to do with the queue sizes. It would be helpful if we could
>> detect the hdr->tag == 0 issue on the sending side and get at least a
>> stack trace. There is something really odd going on here.
>>
>> --td
>>
>> Eloi Gaudry wrote:
>>
>>> Hi Terry,
>>>
>>> I'm sorry to say that I might have missed a point here.
>>>
>>> I've lately been relaunching all previously failing computations with
>>> the message coalescing feature being switched off, and I saw the same
>>> hdr->tag=0 error several times, always during a collective call
>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
>>> soon as I switched to the peer queue option I was previously using
>>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
>>> --mca btl_openib_use_message_coalescing 0), all computations ran
>>> flawlessly.
>>>
>>> As for the reproducer, I've already tried to write something but I
>>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
>>>
>>> Eloi
>>>
>>> On 24/09/2010 18:37, Terry Dontje wrote:
>>>
>>>> Eloi Gaudry wrote:
>>>>
>>>>> Terry,
>>>>>
>>>>> You were right, the error indeed seems to come from the message
>>>>> coalescing feature. If I turn it off using the "--mca
>>>>> btl_openib_use_message_coalescing 0", I'm not able to observe the
>>>>> "hdr->tag=0" error.
>>>>>
>>>>> There are some trac requests associated to very similar error
>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are
>>>>> all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that
>>>>> might be related), aren't they ? What would you suggest Terry ?
>>>>>
>>>> Interesting, though it looks to me like the segv in ticket 2352 would
>>>> have happened on the send side instead of the receive side like you
>>>> have. As to what to do next it would be really nice to have some
>>>> sort of reproducer that we can try and debug what is really going
>>>> on. The only other thing to do without a reproducer is to inspect
>>>> the code on the send side to figure out what might make it generate
>>>> at 0 hdr->tag. Or maybe instrument the send side to stop when it is
>>>> about ready to send a 0 hdr->tag and see if we can see how the code
>>>> got there.
>>>>
>>>> I might have some cycles to look at this Monday.
>>>>
>>>> --td
>>>>
>>>>
>>>>> Eloi
>>>>>
>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
>>>>>
>>>>>> Eloi Gaudry wrote:
>>>>>>
>>>>>>> Terry,
>>>>>>>
>>>>>>> No, I haven't tried any other values than P,65536,256,192,128 yet.
>>>>>>>
>>>>>>> The reason why is quite simple. I've been reading and reading again
>>>>>>> this thread to understand the btl_openib_receive_queues meaning and
>>>>>>> I can't figure out why the default values seem to induce the hdr-
>>>>>>>
>>>>>>>
>>>>>>>> tag=0 issue
>>>>>>>> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php).
>>>>>>>>
>>>>>> Yeah, the size of the fragments and number of them really should not
>>>>>> cause this issue. So I too am a little perplexed about it.
>>>>>>
>>>>>>
>>>>>>> Do you think that the default shared received queue parameters are
>>>>>>> erroneous for this specific Mellanox card ? Any help on finding the
>>>>>>> proper parameters would actually be much appreciated.
>>>>>>>
>>>>>> I don't necessarily think it is the queue size for a specific card but
>>>>>> more so the handling of the queues by the BTL when using certain
>>>>>> sizes. At least that is one gut feel I have.
>>>>>>
>>>>>> In my mind the tag being 0 is either something below OMPI is polluting
>>>>>> the data fragment or OMPI's internal protocol is some how getting
>>>>>> messed up. I can imagine (no empirical data here) the queue sizes
>>>>>> could change how the OMPI protocol sets things up. Another thing may
>>>>>> be the coalescing feature in the openib BTL which tries to gang
>>>>>> multiple messages into one packet when resources are running low. I
>>>>>> can see where changing the queue sizes might affect the coalescing.
>>>>>> So, it might be interesting to turn off the coalescing. You can do
>>>>>> that by setting "--mca btl_openib_use_message_coalescing 0" in your
>>>>>> mpirun line.
>>>>>>
>>>>>> If that doesn't solve the issue then obviously there must be something
>>>>>> else going on :-).
>>>>>>
>>>>>> Note, the reason I am interested in this is I am seeing a similar
>>>>>> error condition (hdr->tag == 0) on a development system. Though my
>>>>>> failing case fails with np=8 using the connectivity test program
>>>>>> which is mainly point to point and there are not a significant amount
>>>>>> of data transfers going on either.
>>>>>>
>>>>>> --td
>>>>>>
>>>>>>
>>>>>>> Eloi
>>>>>>>
>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
>>>>>>>
>>>>>>>> That is interesting. So does the number of processes affect your
>>>>>>>> runs any. The times I've seen hdr->tag be 0 usually has been due
>>>>>>>> to protocol issues. The tag should never be 0. Have you tried to
>>>>>>>> do other receive_queue settings other than the default and the one
>>>>>>>> you mention.
>>>>>>>>
>>>>>>>> I wonder if you did a combination of the two receive queues causes a
>>>>>>>> failure or not. Something like
>>>>>>>>
>>>>>>>> P,128,256,192,128:P,65536,256,192,128
>>>>>>>>
>>>>>>>> I am wondering if it is the first queuing definition causing the
>>>>>>>> issue or possibly the SRQ defined in the default.
>>>>>>>>
>>>>>>>> --td
>>>>>>>>
>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>
>>>>>>>>> Hi Terry,
>>>>>>>>>
>>>>>>>>> The messages being send/received can be of any size, but the error
>>>>>>>>> seems to happen more often with small messages (as an int being
>>>>>>>>> broadcasted or allreduced). The failing communication differs from
>>>>>>>>> one run to another, but some spots are more likely to be failing
>>>>>>>>> than another. And as far as I know, there are always located next
>>>>>>>>> to a small message (an int being broadcasted for instance)
>>>>>>>>> communication. Other typical messages size are
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 10k but can be very much larger.
>>>>>>>>>>
>>>>>>>>> I've been checking the hca being used, its' from mellanox (with
>>>>>>>>> vendor_part_id=26428). There is no receive_queues parameters
>>>>>>>>> associated to it.
>>>>>>>>>
>>>>>>>>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>> # A.k.a. ConnectX
>>>>>>>>> [Mellanox Hermon]
>>>>>>>>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
>>>>>>>>> vendor_part_id =
>>>>>>>>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
>>>>>>>>> use_eager_rdma = 1
>>>>>>>>> mtu = 2048
>>>>>>>>> max_inline_data = 128
>>>>>>>>>
>>>>>>>>> [..]
>>>>>>>>>
>>>>>>>>> $ ompi_info --param btl openib --parsable | grep receive_queues
>>>>>>>>>
>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192
>>>>>>>>> ,128
>>>>>>>>>
>>>>>>>>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
>>>>>>>>>
>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:data_source:default
>>>>>>>>> value
>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:status:writable
>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimit
>>>>>>>>> ed, comma delimited list of receive queues:
>>>>>>>>> P,4096,8,6,4:P,32768,8,6,4
>>>>>>>>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>>>>>>>>>
>>>>>>>>> I was wondering if these parameters (automatically computed at
>>>>>>>>> openib btl init for what I understood) were not incorrect in some
>>>>>>>>> way and I plugged some others values: "P,65536,256,192,128"
>>>>>>>>> (someone on the list used that values when encountering a
>>>>>>>>> different issue) . Since that, I haven't been able to observe the
>>>>>>>>> segfault (occuring as hrd->tag = 0 in btl_openib_component.c:2881)
>>>>>>>>> yet.
>>>>>>>>>
>>>>>>>>> Eloi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
>>>>>>>>>
>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
>>>>>>>>>
>>>>>>>>>> Eloi, I am curious about your problem. Can you tell me what size
>>>>>>>>>> of job it is? Does it always fail on the same bcast, or same
>>>>>>>>>> process?
>>>>>>>>>>
>>>>>>>>>> Eloi Gaudry wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your suggestions.
>>>>>>>>>>>
>>>>>>>>>>> I'm now able to get the checksum computed and redirected to
>>>>>>>>>>> stdout, thanks (I forgot the "-mca pml_base_verbose 5" option,
>>>>>>>>>>> you were right). I haven't been able to observe the segmentation
>>>>>>>>>>> fault (with hdr->tag=0) so far (when using pml csum) but I 'll
>>>>>>>>>>> let you know when I am.
>>>>>>>>>>>
>>>>>>>>>>> I've got two others question, which may be related to the error
>>>>>>>>>>> observed:
>>>>>>>>>>>
>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by
>>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm using
>>>>>>>>>>> openib, may I use the same number of MPI_Comm object as with
>>>>>>>>>>> tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ?
>>>>>>>>>>>
>>>>>>>>>>> 2/ the segfaults only appears during a mpi collective call, with
>>>>>>>>>>> very small message (one int is being broadcast, for instance) ;
>>>>>>>>>>> i followed the guidelines given at http://icl.cs.utk.edu/open-
>>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
>>>>>>>>>>> debug-build of OpenMPI asserts if I use a different min-size that
>>>>>>>>>>> 255. Anyway, if I deactivate eager_rdma, the segfaults remains.
>>>>>>>>>>> Does the openib btl handle very small message differently (even
>>>>>>>>>>> with eager_rdma
>>>>>>>>>>> deactivated) than tcp ?
>>>>>>>>>>>
>>>>>>>>>> Others on the list does coalescing happen with non-eager_rdma? If
>>>>>>>>>> so then that would possibly be one difference between the openib
>>>>>>>>>> btl and tcp aside from the actual protocol used.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> is there a way to make sure that large messages and small
>>>>>>>>>>> messages are handled the same way ?
>>>>>>>>>>>
>>>>>>>>>> Do you mean so they all look like eager messages? How large of
>>>>>>>>>> messages are we talking about here 1K, 1M or 10M?
>>>>>>>>>>
>>>>>>>>>> --td
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Eloi
>>>>>>>>>>>
>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Eloi,
>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug) and while
>>>>>>>>>>>> running with the csum PML add "-mca pml_base_verbose 5" to the
>>>>>>>>>>>> command line. This will print the checksum details for each
>>>>>>>>>>>> fragment sent over the wire. I'm guessing it didnt catch
>>>>>>>>>>>> anything because the BTL failed. The checksum verification is
>>>>>>>>>>>> done in the PML, which the BTL calls via a callback function.
>>>>>>>>>>>> In your case the PML callback is never called because the
>>>>>>>>>>>> hdr->tag is invalid. So enabling checksum tracing also might
>>>>>>>>>>>> not be of much use. Is it the first Bcast that fails or the nth
>>>>>>>>>>>> Bcast and what is the message size? I'm not sure what could be
>>>>>>>>>>>> the problem at this moment. I'm afraid you will have to debug
>>>>>>>>>>>> the BTL to find out more.
>>>>>>>>>>>>
>>>>>>>>>>>> --Nysal
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks for your response.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've been unable so far to write a test case that could
>>>>>>>>>>>>> illustrate the hdr->tag=0 error.
>>>>>>>>>>>>> Actually, I'm only observing this issue when running an
>>>>>>>>>>>>> internode computation involving infiniband hardware from
>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I checked, double-checked, and rechecked again every MPI use
>>>>>>>>>>>>> performed during a parallel computation and I couldn't find any
>>>>>>>>>>>>> error so far. The fact that the very
>>>>>>>>>>>>> same parallel computation run flawlessly when using tcp (and
>>>>>>>>>>>>> disabling openib support) might seem to indicate that the issue
>>>>>>>>>>>>> is somewhere located inside the
>>>>>>>>>>>>> openib btl or at the hardware/driver level.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I haven't seen
>>>>>>>>>>>>> any related messages (when hdr->tag=0 and the segfaults
>>>>>>>>>>>>> occurs). Any suggestion ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Eloi,
>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read the entire
>>>>>>>>>>>>>> email thread, but do you have a test case which can reproduce
>>>>>>>>>>>>>> this error? Without that it will be difficult to nail down
>>>>>>>>>>>>>> the cause. Just to clarify, I do not work for an iwarp
>>>>>>>>>>>>>> vendor. I can certainly try to reproduce it on an IB system.
>>>>>>>>>>>>>> There is also a PML called csum, you can use it via "-mca pml
>>>>>>>>>>>>>> csum", which will checksum the MPI messages and verify it at
>>>>>>>>>>>>>> the receiver side for any data
>>>>>>>>>>>>>> corruption. You can try using it to see if it is able
>>>>>>>>>>>>>>
>>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> catch anything.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> --Nysal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Nysal,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a
>>>>>>>>>>>>>>> chance to look
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> this error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Eloi Gaudry
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Free Field Technologies
>>>>>>>>>>>>>>> Company Website: http://www.fft.be
>>>>>>>>>>>>>>> Company Phone: +32 10 487 959
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>>>>>>> From: Eloi Gaudry <eg_at_[hidden]>
>>>>>>>>>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using openib
>>>>>>>>>>>>>>> btl Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I was wondering if anybody got a chance to have a look at
>>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Jeff,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please find enclosed the output (valgrind.out.gz) from
>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
>>>>>>>>>>>>>>>> pbn11,pbn10 --mca
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> btl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca mpi_warn_on_fork 0
>>>>>>>>>>>>>>>> --mca btl_openib_want_fork_support 0 -tag-output
>>>>>>>>>>>>>>>> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
>>>>>>>>>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmp
>>>>>>>>>>>>>>>> i- valgrind.supp --suppressions=./suppressions.python.supp
>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I did run our application through valgrind but it
>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there is a bunch of
>>>>>>>>>>>>>>>>>>> "Invalid read" (I'm using
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1.4.2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of uninitialized bytes"
>>>>>>>>>>>>>>>>>>> and "Conditional jump depending on uninitialized bytes"
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> different
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ompi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> routines. Some of them are located in
>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an output of
>>>>>>>>>>>>>>>>>>> valgrind shortly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be expected --
>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some of its memory,
>>>>>>>>>>>>>>>>>> and therefore valgrind is unaware of them (and therefore
>>>>>>>>>>>>>>>>>> incorrectly marks them as
>>>>>>>>>>>>>>>>>> uninitialized).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> would it help if i use the upcoming 1.5 version of openmpi
>>>>>>>>>>>>>>>>> ? i
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> read
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the valgrind output
>>>>>>>>>>>>>>>>> ? but maybe that this doesn't concern this btl (for the
>>>>>>>>>>>>>>>>> reasons you mentionned).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Another question, you said that the callback function
>>>>>>>>>>>>>>>>>>> pointer
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an integer.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I was worrying that its value could not be null.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to build libpython
>>>>>>>>>>>>>>>>> without pymalloc first).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function pointer should
>>>>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory corruption is
>>>>>>>>>>>>>>>>>>>> occurring.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because the stack trace
>>>>>>>>>>>>>>>>>>>> looks like you're calling through python, but can you
>>>>>>>>>>>>>>>>>>>> run this application through valgrind, or some other
>>>>>>>>>>>>>>>>>>>> memory-checking debugger?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values of the function
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> parameters:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata
>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0
>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super
>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit =
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> 12288,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size =
>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length = 1048576,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency = 10,
>>>>>>>>>>>>>>>>>>>>> btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>,
>>>>>>>>>>>>>>>>>>>>> btl_del_procs =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>,
>>>>>>>>>>>>>>>>>>>>> btl_register = 0, btl_finalize =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> btl_alloc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91400<mca_btl_openib_free>, btl_prepare_src
>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb91813<mca_btl_openib_prepare_src>,
>>>>>>>>>>>>>>>>>>>>> btl_prepare_dst
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> =
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>, btl_send
>>>>>>>>>>>>>>>>>>>>> = 0x2b341eb94517<mca_btl_openib_send>, btl_sendi =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb9340d<mca_btl_openib_sendi>, btl_put =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb94660<mca_btl_openib_put>, btl_get =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb94c4e<mca_btl_openib_get>, btl_dump =
>>>>>>>>>>>>>>>>>>>>> 0x2b341acd45cb<mca_btl_base_dump>, btl_mpool =
>>>>>>>>>>>>>>>>>>>>> 0xf3f4110, btl_register_error =
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>,
>>>>>>>>>>>>>>>>>>>>> btl_ft_event
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>}
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag
>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0'
>>>>>>>>>>>>>>>>>>>>> (gdb) print des
>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc
>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file generated during a
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> segmentation
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call (using
>>>>>>>>>>>>>>>>>>>>>> openib):
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming
>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2 0x00002aedbc4e25e2 in
>>>>>>>>>>>>>>>>>>>>>> handle_wc (device=0x19024ac0, cq=0,
>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d in
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> poll_device
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> #4
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> (device=0x19024ac0)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5 0x00002aedbc4e3561
>>>>>>>>>>>>>>>>>>>>>> in btl_openib_component_progress () at
>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #6
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in
>>>>>>>>>>>>>>>>>>>>>> opal_condition_wait (c=0x2aedb888ccc0,
>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at ../opal/threads/condition.h:99
>>>>>>>>>>>>>>>>>>>>>> #8
>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in ompi_request_default_wait_all
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> (count=2,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in
>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling
>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20,
>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0,
>>>>>>>>>>>>>>>>>>>>>> module=0x19d82b20) at
>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> op=0x6787a20,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> in
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> op=0x6787a20,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13
>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in FEMTown::Domain::align (itf=
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> er fa ce>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px
>>>>>>>>>>>>>>>>>>>>>> = 0x199942a4, pn = {pi_ = 0x6}}},<No data fields>})
>>>>>>>>>>>>>>>>>>>>>> at interface.cpp:371 #14 0x00000000040cb858 in
>>>>>>>>>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors
>>>>>>>>>>>>>>>>>>>>>> (dim=2,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> set={px
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0x00000000040cbfa8
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px =
>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> =
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at
>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0, args=0x2aaab0765050,
>>>>>>>>>>>>>>>>>>>>>> kwds=0x19d2e950) at check.cpp:31 #17
>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in
>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*,
>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object>
>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050,
>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at
>>>>>>>>>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception.hpp:
>>>>>>>>>>>>>>>>>>>>>> 463
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> #18
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> (self=0x0,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39
>>>>>>>>>>>>>>>>>>>>>> #19 0x00000000044093a0 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>> (f=0x19b52e90, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3921 #20 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx
>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3, argcount=1,
>>>>>>>>>>>>>>>>>>>>>> kws=0x19ace4a0, kwcount=2,
>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120, globals=<value
>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>> args=0x7, argcount=1, kws=0x19acc418, kwcount=3,
>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=<value
>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>> args=0x6, argcount=1, kws=0x19abd328, kwcount=5,
>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=<value
>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>> args=0xb, argcount=1, kws=0x19a89df0, kwcount=10,
>>>>>>>>>>>>>>>>>>>>>> defs=0x0, defcount=0, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27 0x0000000004408f58 in
>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=<value
>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>> args=0x1, argcount=0, kws=0x19a89330, kwcount=0,
>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in
>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=<value
>>>>>>>>>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>> args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0,
>>>>>>>>>>>>>>>>>>>>>> defcount=0, closure=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode
>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0, locals=0x190d9700) at
>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32 0x000000000442853c in
>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8
>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized out>,
>>>>>>>>>>>>>>>>>>>>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in
>>>>>>>>>>>>>>>>>>>>>> PyRun_SimpleStringFlags (command=0x192fd3d8
>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", flags=0x0) at
>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> (this=0x7ffff279f650)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FEMTown::Main::Batch::run
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> 0x0000000001f9aa25
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10
>>>>>>>>>>>>>>>>>>>>>> (gdb) f 1 #1 0x00002aedbc4e05f4 in
>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming (openib_btl=0x1902f9b0,
>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> );
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c
>>>>>>>>>>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming
>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881 reg->cbfunc(
>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des, reg->cbdata
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> );
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876
>>>>>>>>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg =
>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /* call
>>>>>>>>>>>>>>>>>>>>>> registered callback */
>>>>>>>>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t*
>>>>>>>>>>>>>>>>>>>>>> reg; 2880 reg =
>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger + hdr->tag; 2881
>>>>>>>>>>>>>>>>>>>>>> reg->cbfunc(&openib_btl->super, hdr->tag, des,
>>>>>>>>>>>>>>>>>>>>>> reg->cbdata ); 2882
>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
>>>>>>>>>>>>>>>>>>>>>> cqp
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> =
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f;
>>>>>>>>>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff;
>>>>>>>>>>>>>>>>>>>>>> 2885 } else {
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed was that the
>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes later during
>>>>>>>>>>>>>>>>>>>>>>> the parallel computation.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use the
>>>>>>>>>>>>>>>>>>>>>>> "--mca
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> coll
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could check
>>>>>>>>>>>>>>>>>>>>>>> that the issue is not somehow limited to the tuned
>>>>>>>>>>>>>>>>>>>>>>> collective routines.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option as
>>>>>>>>>>>>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always happened
>>>>>>>>>>>>>>>>>>>>>>>>> during a collective communication indeed... does
>>>>>>>>>>>>>>>>>>>>>>>>> it basically
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> switch
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic mode, right ?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand networking
>>>>>>>>>>>>>>>>>>>>>>>> card)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in the
>>>>>>>>>>>>>>>>>>>>>>>>>> basic
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> module,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I used to
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> observe
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib btl
>>>>>>>>>>>>>>>>>>>>>>>>>> triggered from the tuned collective component, in
>>>>>>>>>>>>>>>>>>>>>>>>>> cases where the ofed libraries were installed but
>>>>>>>>>>>>>>>>>>>>>>>>>> no NCA was found on a node. It used to work
>>>>>>>>>>>>>>>>>>>>>>>>>> however with the basic component.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that
>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when selecting
>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now going to
>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive
>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv) and
>>>>>>>>>>>>>>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules option.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms, you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> need to add an extra mca parameter that tells
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the library to use dynamic selection. --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --param
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> coll
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the various collectives. Therefore, you need
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1" allowed to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear algorithm. Anyway
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> whatever the algorithm used, the segmentation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault remains.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> diagnose
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems to randomly segfault when using the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> openib btl. I'd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> like
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> a
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> an
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> openib
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> btl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/help/
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** Process received signal
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *** [pbn08:02624] Signal: Segmentation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault (11) [pbn08:02624] Signal code:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Address not mapped
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> (1)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address: (nil)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [0x349540e4c0] [pbn08:02624] *** End of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> error
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ***
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl (by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using --mca btl self,sm,tcp on the command
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line, for instance), I don't encounter any
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem and the parallel computation runs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flawlessly.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be able:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - to diagnose the issue I'm facing with the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> openib btl - understand why this issue is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed only when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the configure
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scripts of OpenMPI are enclosed to this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> email, and some
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when launching
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> parallel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca mpi_warn_on_fork
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 0 --mca btl_openib_want_fork_support 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --mca
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>
>

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture