Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [openib] segfault when using openib btl
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-09-24 08:27:07


That is interesting. So does the number of processes affect your runs
any. The times I've seen hdr->tag be 0 usually has been due to protocol
issues. The tag should never be 0. Have you tried to do other
receive_queue settings other than the default and the one you mention.

I wonder if you did a combination of the two receive queues causes a
failure or not. Something like

P,128,256,192,128:P,65536,256,192,128

I am wondering if it is the first queuing definition causing the issue or possibly the SRQ defined in the default.

--td

Eloi Gaudry wrote:
> Hi Terry,
>
> The messages being send/received can be of any size, but the error seems to happen more often with small messages (as an int being broadcasted or allreduced).
> The failing communication differs from one run to another, but some spots are more likely to be failing than another. And as far as I know, there are always located next to a small message (an int
> being broadcasted for instance) communication. Other typical messages size are >10k but can be very much larger.
>
> I've been checking the hca being used, its' from mellanox (with vendor_part_id=26428). There is no receive_queues parameters associated to it.
> $ cat share/openmpi/mca-btl-openib-device-params.ini as well:
> [...]
> # A.k.a. ConnectX
> [Mellanox Hermon]
> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
> vendor_part_id = 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488
> use_eager_rdma = 1
> mtu = 2048
> max_inline_data = 128
> [..]
>
> $ ompi_info --param btl openib --parsable | grep receive_queues
> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
> mca:btl:openib:param:btl_openib_receive_queues:status:writable
> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>
> I was wondering if these parameters (automatically computed at openib btl init for what I understood) were not incorrect in some way and I plugged some others values: "P,65536,256,192,128" (someone on
> the list used that values when encountering a different issue) . Since that, I haven't been able to observe the segfault (occuring as hrd->tag = 0 in btl_openib_component.c:2881) yet.
>
> Eloi
>
>
> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
>
> On Thursday 23 September 2010 23:33:48 Terry Dontje wrote:
>
>> Eloi, I am curious about your problem. Can you tell me what size of job
>> it is? Does it always fail on the same bcast, or same process?
>>
>> Eloi Gaudry wrote:
>>
>>> Hi Nysal,
>>>
>>> Thanks for your suggestions.
>>>
>>> I'm now able to get the checksum computed and redirected to stdout,
>>> thanks (I forgot the "-mca pml_base_verbose 5" option, you were right).
>>> I haven't been able to observe the segmentation fault (with hdr->tag=0)
>>> so far (when using pml csum) but I 'll let you know when I am.
>>>
>>> I've got two others question, which may be related to the error observed:
>>>
>>> 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI
>>> somehow depends on the btl being used (i.e. if I'm using openib, may I
>>> use the same number of MPI_Comm object as with tcp) ? Is there something
>>> as MPI_COMM_MAX in OpenMPI ?
>>>
>>> 2/ the segfaults only appears during a mpi collective call, with very
>>> small message (one int is being broadcast, for instance) ; i followed
>>> the guidelines given at http://icl.cs.utk.edu/open-
>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build
>>> of OpenMPI asserts if I use a different min-size that 255. Anyway, if I
>>> deactivate eager_rdma, the segfaults remains. Does the openib btl handle
>>> very small message differently (even with eager_rdma deactivated) than
>>> tcp ?
>>>
>> Others on the list does coalescing happen with non-eager_rdma? If so
>> then that would possibly be one difference between the openib btl and
>> tcp aside from the actual protocol used.
>>
>>
>>> is there a way to make sure that large messages and small messages are
>>> handled the same way ?
>>>
>> Do you mean so they all look like eager messages? How large of messages
>> are we talking about here 1K, 1M or 10M?
>>
>> --td
>>
>>
>>> Regards,
>>> Eloi
>>>
>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
>>>
>>>> Hi Eloi,
>>>> Create a debug build of OpenMPI (--enable-debug) and while running with
>>>> the csum PML add "-mca pml_base_verbose 5" to the command line. This
>>>> will print the checksum details for each fragment sent over the wire.
>>>> I'm guessing it didnt catch anything because the BTL failed. The
>>>> checksum verification is done in the PML, which the BTL calls via a
>>>> callback function. In your case the PML callback is never called
>>>> because the hdr->tag is invalid. So enabling checksum tracing also
>>>> might not be of much use. Is it the first Bcast that fails or the nth
>>>> Bcast and what is the message size? I'm not sure what could be the
>>>> problem at this moment. I'm afraid you will have to debug the BTL to
>>>> find out more.
>>>>
>>>> --Nysal
>>>>
>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>
>>>>> Hi Nysal,
>>>>>
>>>>> thanks for your response.
>>>>>
>>>>> I've been unable so far to write a test case that could illustrate the
>>>>> hdr->tag=0 error.
>>>>> Actually, I'm only observing this issue when running an internode
>>>>> computation involving infiniband hardware from Mellanox (MT25418,
>>>>> ConnectX IB DDR, PCIe 2.0
>>>>> 2.5GT/s, rev a0) with our time-domain software.
>>>>>
>>>>> I checked, double-checked, and rechecked again every MPI use performed
>>>>> during a parallel computation and I couldn't find any error so far. The
>>>>> fact that the very
>>>>> same parallel computation run flawlessly when using tcp (and disabling
>>>>> openib support) might seem to indicate that the issue is somewhere
>>>>> located inside the
>>>>> openib btl or at the hardware/driver level.
>>>>>
>>>>> I've just used the "-mca pml csum" option and I haven't seen any
>>>>> related messages (when hdr->tag=0 and the segfaults occurs).
>>>>> Any suggestion ?
>>>>>
>>>>> Regards,
>>>>> Eloi
>>>>>
>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan wrote:
>>>>>
>>>>>> Hi Eloi,
>>>>>> Sorry for the delay in response. I haven't read the entire email
>>>>>> thread, but do you have a test case which can reproduce this error?
>>>>>> Without that it will be difficult to nail down the cause. Just to
>>>>>> clarify, I do not work for an iwarp vendor. I can certainly try to
>>>>>> reproduce it on an IB system. There is also a PML called csum, you can
>>>>>> use it via "-mca pml csum", which will checksum the MPI messages and
>>>>>> verify it at the receiver side for any data corruption. You can try
>>>>>> using it to see if it is able
>>>>>>
>>>>> to
>>>>>
>>>>>
>>>>>> catch anything.
>>>>>>
>>>>>> Regards
>>>>>> --Nysal
>>>>>>
>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <eg_at_[hidden]> wrote:
>>>>>>
>>>>>>> Hi Nysal,
>>>>>>>
>>>>>>> I'm sorry to intrrupt, but I was wondering if you had a chance to
>>>>>>> look
>>>>>>>
>>>>> at
>>>>>
>>>>>
>>>>>>> this error.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Eloi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>> Eloi Gaudry
>>>>>>>
>>>>>>> Free Field Technologies
>>>>>>> Company Website: http://www.fft.be
>>>>>>> Company Phone: +32 10 487 959
>>>>>>>
>>>>>>>
>>>>>>> ---------- Forwarded message ----------
>>>>>>> From: Eloi Gaudry <eg_at_[hidden]>
>>>>>>> To: Open MPI Users <users_at_[hidden]>
>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
>>>>>>> Subject: Re: [OMPI users] [openib] segfault when using openib btl
>>>>>>> Hi,
>>>>>>>
>>>>>>> I was wondering if anybody got a chance to have a look at this issue.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Eloi
>>>>>>>
>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
>>>>>>>
>>>>>>>> Hi Jeff,
>>>>>>>>
>>>>>>>> Please find enclosed the output (valgrind.out.gz) from
>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca
>>>>>>>>
>>>>> btl
>>>>>
>>>>>
>>>>>>>> openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
>>>>>>>> btl_openib_want_fork_support 0 -tag-output
>>>>>>>> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
>>>>>>>> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
>>>>>>>> valgrind.supp --suppressions=./suppressions.python.supp
>>>>>>>> /opt/actran/bin/actranpy_mp ...
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Eloi
>>>>>>>>
>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
>>>>>>>>
>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
>>>>>>>>>
>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
>>>>>>>>>>
>>>>>>>>>>> I did run our application through valgrind but it couldn't
>>>>>>>>>>> find any "Invalid write": there is a bunch of "Invalid read"
>>>>>>>>>>> (I'm using
>>>>>>>>>>>
>>>>>>> 1.4.2
>>>>>>>
>>>>>>>
>>>>>>>>>>> with the suppression file), "Use of uninitialized bytes" and
>>>>>>>>>>> "Conditional jump depending on uninitialized bytes" in
>>>>>>>>>>>
>>>>> different
>>>>>
>>>>>
>>>>>>> ompi
>>>>>>>
>>>>>>>
>>>>>>>>>>> routines. Some of them are located in btl_openib_component.c.
>>>>>>>>>>> I'll send you an output of valgrind shortly.
>>>>>>>>>>>
>>>>>>>>>> A lot of them in btl_openib_* are to be expected -- OpenFabrics
>>>>>>>>>> uses OS-bypass methods for some of its memory, and therefore
>>>>>>>>>> valgrind is unaware of them (and therefore incorrectly marks
>>>>>>>>>> them as
>>>>>>>>>> uninitialized).
>>>>>>>>>>
>>>>>>>>> would it help if i use the upcoming 1.5 version of openmpi ? i
>>>>>>>>>
>>>>> read
>>>>>
>>>>>
>>>>>>> that
>>>>>>>
>>>>>>>
>>>>>>>>> a huge effort has been done to clean-up the valgrind output ? but
>>>>>>>>> maybe that this doesn't concern this btl (for the reasons you
>>>>>>>>> mentionned).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Another question, you said that the callback function pointer
>>>>>>>>>>>
>>>>>>> should
>>>>>>>
>>>>>>>
>>>>>>>>>>> never be 0. But can the tag be null (hdr->tag) ?
>>>>>>>>>>>
>>>>>>>>>> The tag is not a pointer -- it's just an integer.
>>>>>>>>>>
>>>>>>>>> I was worrying that its value could not be null.
>>>>>>>>>
>>>>>>>>> I'll send a valgrind output soon (i need to build libpython
>>>>>>>>> without pymalloc first).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Eloi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>> Eloi
>>>>>>>>>>>
>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for the delay in replying.
>>>>>>>>>>>>
>>>>>>>>>>>> Odd; the values of the callback function pointer should
>>>>>>>>>>>> never
>>>>>>>>>>>>
>>>>> be
>>>>>
>>>>>
>>>>>>> 0.
>>>>>>>
>>>>>>>
>>>>>>>>>>>> This seems to suggest some kind of memory corruption is
>>>>>>>>>>>> occurring.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't know if it's possible, because the stack trace looks
>>>>>>>>>>>> like you're calling through python, but can you run this
>>>>>>>>>>>> application through valgrind, or some other memory-checking
>>>>>>>>>>>> debugger?
>>>>>>>>>>>>
>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> sorry, i just forgot to add the values of the function
>>>>>>>>>>>>>
>>>>>>> parameters:
>>>>>>>
>>>>>>>>>>>>> (gdb) print reg->cbdata
>>>>>>>>>>>>> $1 = (void *) 0x0
>>>>>>>>>>>>> (gdb) print openib_btl->super
>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit =
>>>>>>>>>>>>>
>>>>> 12288,
>>>>>
>>>>>
>>>>>>>>>>>>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
>>>>>>>>>>>>> btl_rdma_pipeline_send_length = 1048576,
>>>>>>>>>>>>>
>>>>>>>>>>>>> btl_rdma_pipeline_frag_size = 1048576,
>>>>>>>>>>>>>
>>>>>>> btl_min_rdma_pipeline_size
>>>>>>>
>>>>>>>
>>>>>>>>>>>>> = 1060864, btl_exclusivity = 1024, btl_latency = 10,
>>>>>>>>>>>>> btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
>>>>>>>>>>>>> 0x2b341eb8ee47<mca_btl_openib_add_procs>, btl_del_procs =
>>>>>>>>>>>>> 0x2b341eb90156<mca_btl_openib_del_procs>, btl_register =
>>>>>>>>>>>>> 0, btl_finalize =
>>>>>>>>>>>>> 0x2b341eb93186<mca_btl_openib_finalize>,
>>>>>>>>>>>>>
>>>>>>> btl_alloc
>>>>>>>
>>>>>>>
>>>>>>>>>>>>> = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free =
>>>>>>>>>>>>> 0x2b341eb91400<mca_btl_openib_free>, btl_prepare_src =
>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>,
>>>>>>>>>>>>> btl_prepare_dst
>>>>>>>>>>>>>
>>>>> =
>>>>>
>>>>>
>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>, btl_send =
>>>>>>>>>>>>> 0x2b341eb94517<mca_btl_openib_send>, btl_sendi =
>>>>>>>>>>>>> 0x2b341eb9340d<mca_btl_openib_sendi>, btl_put =
>>>>>>>>>>>>> 0x2b341eb94660<mca_btl_openib_put>, btl_get =
>>>>>>>>>>>>> 0x2b341eb94c4e<mca_btl_openib_get>, btl_dump =
>>>>>>>>>>>>> 0x2b341acd45cb<mca_btl_base_dump>, btl_mpool = 0xf3f4110,
>>>>>>>>>>>>> btl_register_error =
>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>,
>>>>>>>>>>>>> btl_ft_event
>>>>>>>>>>>>>
>>>>>>> =
>>>>>>>
>>>>>>>
>>>>>>>>>>>>> 0x2b341eb952e7<mca_btl_openib_ft_event>}
>>>>>>>>>>>>>
>>>>>>>>>>>>> (gdb) print hdr->tag
>>>>>>>>>>>>> $3 = 0 '\0'
>>>>>>>>>>>>> (gdb) print des
>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
>>>>>>>>>>>>> (gdb) print reg->cbfunc
>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here is the output of a core file generated during a
>>>>>>>>>>>>>>
>>>>>>> segmentation
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> fault observed during a collective call (using openib):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>> (gdb) where
>>>>>>>>>>>>>> #0 0x0000000000000000 in ?? ()
>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming
>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
>>>>>>>>>>>>>> byte_len=18) at btl_openib_component.c:2881 #2
>>>>>>>>>>>>>> 0x00002aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0,
>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
>>>>>>>>>>>>>> btl_openib_component.c:3178 #3 0x00002aedbc4e2e9d in
>>>>>>>>>>>>>>
>>>>>>> poll_device
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
>>>>>>>>>>>>>> btl_openib_component.c:3318
>>>>>>>>>>>>>>
>>>>> #4
>>>>>
>>>>>
>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
>>>>>>>>>>>>>>
>>>>> (device=0x19024ac0)
>>>>>
>>>>>
>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5 0x00002aedbc4e3561 in
>>>>>>>>>>>>>> btl_openib_component_progress () at
>>>>>>>>>>>>>> btl_openib_component.c:3451
>>>>>>>>>>>>>>
>>>>>>> #6
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in
>>>>>>>>>>>>>> opal_condition_wait (c=0x2aedb888ccc0, m=0x2aedb888cd20)
>>>>>>>>>>>>>> at ../opal/threads/condition.h:99 #8
>>>>>>>>>>>>>> 0x00002aedb859fa31 in ompi_request_default_wait_all
>>>>>>>>>>>>>>
>>>>> (count=2,
>>>>>
>>>>>
>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
>>>>>>>>>>>>>> request/req_wait.c:262 #9 0x00002aedbd7559ad in
>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_recursivedoubling
>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20,
>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
>>>>>>>>>>>>>>
>>>>>>> coll_tuned_allreduce.c:223
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
>>>>>>>>>>>>>> dtype=0x6788220, op=0x6787a20, comm=0x19d81ff0,
>>>>>>>>>>>>>> module=0x19d82b20) at
>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
>>>>>>>>>>>>>>
>>>>>>> (sendbuf=0x7ffff279d444,
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
>>>>>>>>>>>>>>
>>>>>>> op=0x6787a20,
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
>>>>>>>>>>>>>> 0x0000000004387dbf
>>>>>>>>>>>>>>
>>>>> in
>>>>>
>>>>>
>>>>>>>>>>>>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220,
>>>>>>>>>>>>>>
>>>>>>> op=0x6787a20,
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13 0x0000000004058be8
>>>>>>>>>>>>>> in FEMTown::Domain::align (itf=
>>>>>>>>>>>>>>
>>>>>>> {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> er fa ce>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px =
>>>>>>>>>>>>>> 0x199942a4, pn = {pi_ = 0x6}}},<No data fields>}) at
>>>>>>>>>>>>>> interface.cpp:371 #14 0x00000000040cb858 in
>>>>>>>>>>>>>> FEMTown::Field::detail::align_itfs_and_neighbhors (dim=2,
>>>>>>>>>>>>>>
>>>>>>> set={px
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15
>>>>>>>>>>>>>>
>>>>>>> 0x00000000040cbfa8
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> in FEMTown::Field::align_elements (set={px =
>>>>>>>>>>>>>> 0x7ffff279d950, pn
>>>>>>>>>>>>>>
>>>>>>> =
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at
>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
>>>>>>>>>>>>>> PyField_align_elements (self=0x0, args=0x2aaab0765050,
>>>>>>>>>>>>>> kwds=0x19d2e950) at check.cpp:31 #17 0x0000000001fbf76d in
>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object* (*)(_object*, _object*,
>>>>>>>>>>>>>> _object*)>::exec<_object>
>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050,
>>>>>>>>>>>>>> po2=0x19d2e950) at
>>>>>>>>>>>>>> /home/qa/svntop/femtown/modules/main/py/exception.hpp:463
>>>>>>>>>>>>>>
>>>>> #18
>>>>>
>>>>>
>>>>>>>>>>>>>> 0x00000000039acc82 in PyField_align_elements_ewrap
>>>>>>>>>>>>>>
>>>>> (self=0x0,
>>>>>
>>>>>
>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19
>>>>>>>>>>>>>> 0x00000000044093a0 in PyEval_EvalFrameEx (f=0x19b52e90,
>>>>>>>>>>>>>> throwflag=<value optimized out>) at Python/ceval.c:3921
>>>>>>>>>>>>>> #20 0x000000000440aae9 in PyEval_EvalCodeEx
>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value optimized out>,
>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3, argcount=1,
>>>>>>>>>>>>>> kws=0x19ace4a0, kwcount=2,
>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at
>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>> Python/ceval.c:3802 #22 0x000000000440aae9 in
>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120, globals=<value
>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x7,
>>>>>>>>>>>>>> argcount=1, kws=0x19acc418, kwcount=3,
>>>>>>>>>>>>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at
>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>> Python/ceval.c:3802 #24 0x000000000440aae9 in
>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738, globals=<value
>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x6,
>>>>>>>>>>>>>> argcount=1, kws=0x19abd328, kwcount=5,
>>>>>>>>>>>>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at
>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>> Python/ceval.c:3802 #26 0x000000000440aae9 in
>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198, globals=<value
>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0xb,
>>>>>>>>>>>>>> argcount=1, kws=0x19a89df0, kwcount=10, defs=0x0,
>>>>>>>>>>>>>> defcount=0, closure=0x0) at Python/ceval.c:2968
>>>>>>>>>>>>>> #27 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value optimized out>) at
>>>>>>>>>>>>>> Python/ceval.c:3802 #28 0x000000000440aae9 in
>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288, globals=<value
>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x1,
>>>>>>>>>>>>>> argcount=0, kws=0x19a89330, kwcount=0,
>>>>>>>>>>>>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at
>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx
>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value optimized out>) at
>>>>>>>>>>>>>> Python/ceval.c:3802 #30 0x000000000440aae9 in
>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738, globals=<value
>>>>>>>>>>>>>> optimized out>, locals=<value optimized out>, args=0x0,
>>>>>>>>>>>>>> argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
>>>>>>>>>>>>>> closure=0x0) at
>>>>>>>>>>>>>> Python/ceval.c:2968
>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode (co=0x1902f9b0,
>>>>>>>>>>>>>> globals=0x0, locals=0x190d9700) at Python/ceval.c:522 #32
>>>>>>>>>>>>>> 0x000000000442853c in PyRun_StringFlags (str=0x192fd3d8
>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value optimized out>,
>>>>>>>>>>>>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at
>>>>>>>>>>>>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in
>>>>>>>>>>>>>> PyRun_SimpleStringFlags (command=0x192fd3d8
>>>>>>>>>>>>>> "DIRECT.Actran.main()", flags=0x0) at
>>>>>>>>>>>>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
>>>>>>>>>>>>>>
>>>>> (this=0x7ffff279f650)
>>>>>
>>>>>
>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
>>>>>>>>>>>>>>
>>>>>>> FEMTown::Main::Batch::run
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
>>>>>>>>>>>>>>
>>>>> 0x0000000001f9aa25
>>>>>
>>>>>
>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at main.cpp:10 (gdb)
>>>>>>>>>>>>>> f 1 #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming
>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
>>>>>>>>>>>>>> byte_len=18) at btl_openib_component.c:2881 2881
>>>>>>>>>>>>>> reg->cbfunc( &openib_btl->super, hdr->tag, des,
>>>>>>>>>>>>>> reg->cbdata
>>>>>>>>>>>>>>
>>>>> );
>>>>>
>>>>>
>>>>>>>>>>>>>> Current language: auto; currently c
>>>>>>>>>>>>>> (gdb)
>>>>>>>>>>>>>> #1 0x00002aedbc4e05f4 in btl_openib_handle_incoming
>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
>>>>>>>>>>>>>> byte_len=18) at btl_openib_component.c:2881 2881
>>>>>>>>>>>>>> reg->cbfunc( &openib_btl->super, hdr->tag, des,
>>>>>>>>>>>>>> reg->cbdata
>>>>>>>>>>>>>>
>>>>> );
>>>>>
>>>>>
>>>>>>>>>>>>>> (gdb) l 2876
>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg =
>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /* call
>>>>>>>>>>>>>> registered callback */
>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t* reg;
>>>>>>>>>>>>>> 2880 reg = mca_btl_base_active_message_trigger
>>>>>>>>>>>>>> + hdr->tag; 2881
>>>>>>>>>>>>>> reg->cbfunc(&openib_btl->super, hdr->tag, des,
>>>>>>>>>>>>>> reg->cbdata ); 2882
>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
>>>>>>>>>>>>>> cqp
>>>>>>>>>>>>>>
>>>>> =
>>>>>
>>>>>
>>>>>>>>>>>>>> (hdr->credits>> 11)& 0x0f;
>>>>>>>>>>>>>> 2884 hdr->credits&= 0x87ff;
>>>>>>>>>>>>>> 2885 } else {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Edgar,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The only difference I could observed was that the
>>>>>>>>>>>>>>> segmentation fault appeared sometimes later during the
>>>>>>>>>>>>>>> parallel computation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm running out of idea here. I wish I could use the
>>>>>>>>>>>>>>> "--mca
>>>>>>>>>>>>>>>
>>>>>>> coll
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so that I could check
>>>>>>>>>>>>>>> that the issue is not somehow limited to the tuned
>>>>>>>>>>>>>>> collective routines.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> hi edgar,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try this option as well.
>>>>>>>>>>>>>>>>>
>>>>> the
>>>>>
>>>>>
>>>>>>>>>>>>>>>>> segmentation fault i'm observing always happened during
>>>>>>>>>>>>>>>>> a collective communication indeed... does it basically
>>>>>>>>>>>>>>>>>
>>>>> switch
>>>>>
>>>>>
>>>>>>> all
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>> collective communication to basic mode, right ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a NCA ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand networking card)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> you could try first to use the algorithms in the basic
>>>>>>>>>>>>>>>>>>
>>>>>>> module,
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> and see whether this makes a difference. I used to
>>>>>>>>>>>>>>>>>>
>>>>> observe
>>>>>
>>>>>
>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in the openib btl
>>>>>>>>>>>>>>>>>> triggered from the tuned collective component, in
>>>>>>>>>>>>>>>>>> cases where the ofed libraries were installed but no
>>>>>>>>>>>>>>>>>> NCA was found on a node. It used to work however with
>>>>>>>>>>>>>>>>>> the basic component.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Edgar
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> hi Rolf,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid of that annoying
>>>>>>>>>>>>>>>>>>> segmentation fault when selecting another bcast
>>>>>>>>>>>>>>>>>>> algorithm. i'm now going to replace MPI_Bcast with a
>>>>>>>>>>>>>>>>>>> naive
>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and MPI_Recv) and see
>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>
>>>>>>> that
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>>>>> éloi
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Rolf,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> thanks for your input. You're right, I miss the
>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules option.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation fault disappears
>>>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>
>>>>>>> using
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm using the proper
>>>>>>>>>>>>>>>>>>>> command line you provided.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart
>>>>>>>>>>>>>>>>>>>>
>>>>> wrote:
>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
>>>>>>>>>>>>>>>>>>>>> To select the different bcast algorithms, you need
>>>>>>>>>>>>>>>>>>>>> to add an extra mca parameter that tells the
>>>>>>>>>>>>>>>>>>>>> library to use dynamic selection. --mca
>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> One way to make sure you are typing this in
>>>>>>>>>>>>>>>>>>>>> correctly is
>>>>>>>>>>>>>>>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>> use it with ompi_info. Do the following:
>>>>>>>>>>>>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1
>>>>>>>>>>>>>>>>>>>>> --param
>>>>>>>>>>>>>>>>>>>>>
>>>>>>> coll
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>> You should see lots of output with all the
>>>>>>>>>>>>>>>>>>>>> different algorithms that can be selected for the
>>>>>>>>>>>>>>>>>>>>> various collectives. Therefore, you need this:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Rolf
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca coll_tuned_bcast_algorithm
>>>>>>>>>>>>>>>>>>>>>> 1" allowed to switch to the basic linear
>>>>>>>>>>>>>>>>>>>>>> algorithm. Anyway whatever the algorithm used,
>>>>>>>>>>>>>>>>>>>>>> the segmentation fault remains.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some advice on ways to
>>>>>>>>>>>>>>>>>>>>>>
>>>>> diagnose
>>>>>
>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that seems
>>>>>>>>>>>>>>>>>>>>>>> to randomly segfault when using the openib btl.
>>>>>>>>>>>>>>>>>>>>>>> I'd
>>>>>>>>>>>>>>>>>>>>>>>
>>>>> like
>>>>>
>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to make OpenMPI switch
>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>
>>>>> a
>>>>>
>>>>>
>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the default one being
>>>>>>>>>>>>>>>>>>>>>>> selected for MPI_Bcast.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random segmentation fault during
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>> an
>>>>>
>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> internode parallel computation involving the
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>> openib
>>>>>
>>>>>
>>>>>>> btl
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be
>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> mpirun (Open MPI) 1.4.2
>>>>>>>>>>>>>>>>>>>>>>>> Report bugs to
>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/help/
>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] *** Process received signal ***
>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal: Segmentation fault (11)
>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Signal code: Address not mapped
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>> (1)
>>>>>
>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] Failing at address: (nil)
>>>>>>>>>>>>>>>>>>>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0
>>>>>>>>>>>>>>>>>>>>>>>> [0x349540e4c0] [pbn08:02624] *** End of error
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> message
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ***
>>>>>>>>>>>>>>>>>>>>>>>> sh: line 1: 2624 Segmentation fault
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\ /bin\/actranpy_mp
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c tran_11.0.rc2.41872'
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL' '--t_max=0.1'
>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the openib btl (by using
>>>>>>>>>>>>>>>>>>>>>>>> --mca btl self,sm,tcp on the command line, for
>>>>>>>>>>>>>>>>>>>>>>>> instance), I don't encounter any problem and the
>>>>>>>>>>>>>>>>>>>>>>>> parallel computation runs flawlessly.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help to be able:
>>>>>>>>>>>>>>>>>>>>>>>> - to diagnose the issue I'm facing with the
>>>>>>>>>>>>>>>>>>>>>>>> openib btl - understand why this issue is
>>>>>>>>>>>>>>>>>>>>>>>> observed only when
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> using
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using self,sm,tcp
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much appreciated.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the configure
>>>>>>>>>>>>>>>>>>>>>>>> scripts of OpenMPI are enclosed to this email,
>>>>>>>>>>>>>>>>>>>>>>>> and some
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> information
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as well.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used when launching a
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> parallel
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> computation
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS
>>>>>>>>>>>>>>>>>>>>>>>> --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp --display-map --verbose
>>>>>>>>>>>>>>>>>>>>>>>> --version --mca mpi_warn_on_fork 0 --mca
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...]
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if not using infiniband:
>>>>>>>>>>>>>>>>>>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS
>>>>>>>>>>>>>>>>>>>>>>>> --hostfile host.list --mca
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp --display-map --verbose
>>>>>>>>>>>>>>>>>>>>>>>> --version
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> --mca
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>> 0
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Eloi
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>
>

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture