Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-12-02 03:56:52


Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when
setting ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the
typical stack.

Without my "reproducer patch", 80 nodes was the lower bound to reproduce
the bug (and you needed a couple of runs to get it). But since this is a
race condition, your mileage may vary on a different cluster.

With the patch however, I'm in every time. I'll continue to try different
configurations (e.g. without slurm ...) to see if I can reproduce it on
much common configurations.

Sylvain

On Mon, 30 Nov 2009, Sylvain Jeaugey wrote:

> Ok. Maybe I should try on a RHEL5 then.
>
> About the compilers, I've tried with both gcc and intel and it doesn't seem
> to make a difference.
>
> On Mon, 30 Nov 2009, Ralph Castain wrote:
>
>> Interesting. The only difference I see is the FC11 - I haven't seen anyone
>> running on that OS yet. I wonder if that is the source of the trouble? Do
>> we know that our code works on that one? I know we had problems in the past
>> with FC9, for example, that required fixes.
>>
>> Also, what compiler are you using? I wonder if there is some optimization
>> issue here, or some weird interaction between FC11 and the compiler.
>>
>> On Nov 30, 2009, at 8:48 AM, Sylvain Jeaugey wrote:
>>
>>> Hi Ralph,
>>>
>>> I'm also puzzled :-)
>>>
>>> Here is what I did today :
>>> * download the latest nightly build (openmpi-1.7a1r22241)
>>> * untar it
>>> * patch it with my "ORTE_RELAY_DELAY" patch
>>> * build it directly on the cluster (running FC11) with :
>>> ./configure --platform=contrib/platform/lanl/tlcc/debug-nopanasas
>>> --prefix=<some path in my home>
>>> make && make install
>>>
>>> * deactivate oob_tcp_if_include=ib0 in openmpi-mca-params.conf (IPoIB is
>>> broken on my machine) and run with :
>>> salloc -N 10 mpirun ./helloworld
>>>
>>> And .. still the same behaviour : ok by default, deadlock with the typical
>>> stack when setting ORTE_RELAY_DELAY to 1.
>>>
>>> About my previous e-mail, I was wrong about all components having a 0
>>> priority : it was based on default parameters reported by "ompi_info -a |
>>> grep routed". It seems that the truth is not always in ompi_info ...
>>>
>>> Sylvain
>>>
>>> On Fri, 27 Nov 2009, Ralph Castain wrote:
>>>
>>>>
>>>> On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:
>>>>
>>>>> Hi Ralph,
>>>>>
>>>>> I tried with the trunk and it makes no difference for me.
>>>>
>>>> Strange
>>>>
>>>>>
>>>>> Looking at potential differences, I found out something strange. The bug
>>>>> may have something to do with the "routed" framework. I can reproduce
>>>>> the bug with binomial and direct, but not with cm and linear (you
>>>>> disabled the build of the latest in your configure options -- why ?).
>>>>
>>>> You won't with cm because there is no relay. Likewise, direct doesn't
>>>> have a relay - so I'm really puzzled how you can see this behavior when
>>>> using the direct component???
>>>>
>>>> I disable components in my build to save memory. Every component we open
>>>> costs us memory that may or may not be recoverable during the course of
>>>> execution.
>>>>
>>>>>
>>>>> Btw, all components have a 0 priority and none is defined to be the
>>>>> default component. Which one is the default then ? binomial (as the
>>>>> first in alphabetical order) ?
>>>>
>>>> I believe you must have a severely corrupted version of the code. The
>>>> binomial component has priority 70 so it will be selected as the default.
>>>>
>>>> Linear has priority 40, though it will only be selected if you say
>>>> ^binomial.
>>>>
>>>> CM and radix have special selection code in them so they will only be
>>>> selected when specified.
>>>>
>>>> Direct and slave have priority 0 to ensure they will only be selected
>>>> when specified
>>>>
>>>>>
>>>>> Can you check which one you are using and try with binomial explicitely
>>>>> chosen ?
>>>>
>>>> I am using binomial for all my tests
>>>>
>>>>> From what you are describing, I think you either have a corrupted copy
>>>>> of the code, are picking up mis-matched versions, or something strange
>>>>> as your experiences don't match what anyone else is seeing.
>>>>
>>>> Remember, the phase you are discussing here has nothing to do with the
>>>> native launch environment. This is dealing with the relative timing of
>>>> the application launch versus relaying the launch message itself - i.e.,
>>>> the daemons are already up and running before any of this starts. Thus,
>>>> this "problem" has nothing to do with how we launch the daemons. So, if
>>>> it truly were a problem in the code, we would see it on every environment
>>>> - torque, slurm, ssh, etc.
>>>>
>>>> We routinely launch jobs spanning hundreds to thousands of nodes without
>>>> problem. If this timing problem was as you have identified, then we would
>>>> see this constantly. Yet nobody is seeing it, and I cannot reproduce it
>>>> even with your reproducer.
>>>>
>>>> I honestly don't know what to suggest at this point. Any chance you are
>>>> picking up mis-matched OMPI versions are your backend nodes or something?
>>>> Tried fresh checkouts of the code? Is this a code base you have modified,
>>>> or are you seeing this with the "stock" code from the repo?
>>>>
>>>> Just fishing at this point - can't find anything wrong! :-/
>>>> Ralph
>>>>
>>>>
>>>>>
>>>>> Thanks for your time,
>>>>> Sylvain
>>>>>
>>>>> On Thu, 26 Nov 2009, Ralph Castain wrote:
>>>>>
>>>>>> Hi Sylvain
>>>>>>
>>>>>> Well, I hate to tell you this, but I cannot reproduce the "bug" even
>>>>>> with this code in ORTE no matter what value of ORTE_RELAY_DELAY I use.
>>>>>> The system runs really slow as I increase the delay, but it completes
>>>>>> the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4
>>>>>> ppn, a "hello world" app that calls MPI_Init immediately upon
>>>>>> execution.
>>>>>>
>>>>>> So I have to conclude this is a problem in your setup/config. Are you
>>>>>> sure you didn't --enable-progress-threads?? That is the only way I can
>>>>>> recreate this behavior.
>>>>>>
>>>>>> I plan to modify the relay/message processing method anyway to clean it
>>>>>> up. But there doesn't appear to be anything wrong with the current
>>>>>> code.
>>>>>> Ralph
>>>>>>
>>>>>> On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
>>>>>>
>>>>>>> Hi Ralph,
>>>>>>>
>>>>>>> Thanks for your efforts. I will look at our configuration and see how
>>>>>>> it may differ from ours.
>>>>>>>
>>>>>>> Here is a patch which helps reproducing the bug even with a small
>>>>>>> number of nodes.
>>>>>>>
>>>>>>> diff -r b622b9e8f1ac orte/orted/orted_comm.c
>>>>>>> --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100
>>>>>>> +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100
>>>>>>> @@ -126,6 +126,13 @@
>>>>>>> ORTE_ERROR_LOG(ret);
>>>>>>> goto CLEANUP;
>>>>>>> }
>>>>>>> + { /* Add delay to reproduce bug */
>>>>>>> + char * str = getenv("ORTE_RELAY_DELAY");
>>>>>>> + int sec = str ? atoi(str) : 0;
>>>>>>> + if (sec) {
>>>>>>> + sleep(sec);
>>>>>>> + }
>>>>>>> + }
>>>>>>> }
>>>>>>>
>>>>>>> CLEANUP:
>>>>>>>
>>>>>>> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the
>>>>>>> bug.
>>>>>>>
>>>>>>> During our experiments, the bug disappeared when we added a delay
>>>>>>> before calling MPI_Init. So, configurations where processes are
>>>>>>> launched slowly or take some time before MPI_Init should be immune to
>>>>>>> this bug.
>>>>>>>
>>>>>>> We usually reproduce the bug with one ppn (faster to spawn).
>>>>>>>
>>>>>>> Sylvain
>>>>>>>
>>>>>>> On Thu, 19 Nov 2009, Ralph Castain wrote:
>>>>>>>
>>>>>>>> Hi Sylvain
>>>>>>>>
>>>>>>>> I've spent several hours trying to replicate the behavior you
>>>>>>>> described on clusters up to a couple of hundred nodes (all running
>>>>>>>> slurm), without success. I'm becoming increasingly convinced that
>>>>>>>> this is a configuration issue as opposed to a code issue.
>>>>>>>>
>>>>>>>> I have enclosed the platform file I use below. Could you compare it
>>>>>>>> to your configuration? I'm wondering if there is something critical
>>>>>>>> about the config that may be causing the problem (perhaps we have a
>>>>>>>> problem in our default configuration).
>>>>>>>>
>>>>>>>> Also, is there anything else you can tell us about your
>>>>>>>> configuration? How many ppn triggers it, or do you always get the
>>>>>>>> behavior every time you launch over a certain number of nodes?
>>>>>>>>
>>>>>>>> Meantime, I will look into this further. I am going to introduce a
>>>>>>>> "slow down" param that will force the situation you encountered -
>>>>>>>> i.e., will ensure that the relay is still being sent when the daemon
>>>>>>>> receives the first collective input. We can then use that to try and
>>>>>>>> force replication of the behavior you are encountering.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> enable_dlopen=no
>>>>>>>> enable_pty_support=no
>>>>>>>> with_blcr=no
>>>>>>>> with_openib=yes
>>>>>>>> with_memory_manager=no
>>>>>>>> enable_mem_debug=yes
>>>>>>>> enable_mem_profile=no
>>>>>>>> enable_debug_symbols=yes
>>>>>>>> enable_binaries=yes
>>>>>>>> with_devel_headers=yes
>>>>>>>> enable_heterogeneous=no
>>>>>>>> enable_picky=yes
>>>>>>>> enable_debug=yes
>>>>>>>> enable_shared=yes
>>>>>>>> enable_static=yes
>>>>>>>> with_slurm=yes
>>>>>>>> enable_contrib_no_build=libnbc,vt
>>>>>>>> enable_visibility=yes
>>>>>>>> enable_memchecker=no
>>>>>>>> enable_ipv6=no
>>>>>>>> enable_mpi_f77=no
>>>>>>>> enable_mpi_f90=no
>>>>>>>> enable_mpi_cxx=no
>>>>>>>> enable_mpi_cxx_seek=no
>>>>>>>> enable_mca_no_build=pml-dr,pml-crcp2,crcp
>>>>>>>> enable_io_romio=no
>>>>>>>>
>>>>>>>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:
>>>>>>>>>
>>>>>>>>>> Thank you Ralph for this precious help.
>>>>>>>>>>
>>>>>>>>>> I setup a quick-and-dirty patch basically postponing process_msg
>>>>>>>>>> (hence daemon_collective) until the launch is done. In process_msg,
>>>>>>>>>> I therefore requeue a process_msg handler and return.
>>>>>>>>>
>>>>>>>>> That is basically the idea I proposed, just done in a slightly
>>>>>>>>> different place
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In this "all-must-be-non-blocking-and-done-through-opal_progress"
>>>>>>>>>> algorithm, I don't think that blocking calls like the one in
>>>>>>>>>> daemon_collective should be allowed. This also applies to the
>>>>>>>>>> blocking one in send_relay. [Well, actually, one is okay, 2 may
>>>>>>>>>> lead to interlocking.]
>>>>>>>>>
>>>>>>>>> Well, that would be problematic - you will find "progressed_wait"
>>>>>>>>> used repeatedly in the code. Removing them all would take a -lot- of
>>>>>>>>> effort and a major rewrite. I'm not yet convinced it is required.
>>>>>>>>> There may be something strange in how you are setup, or your cluster
>>>>>>>>> - like I said, this is the first report of a problem we have had,
>>>>>>>>> and people with much bigger slurm clusters have been running this
>>>>>>>>> code every day for over a year.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If you have time doing a nicer patch, it would be great and I would
>>>>>>>>>> be happy to test it. Otherwise, I will try to implement your idea
>>>>>>>>>> properly next week (with my limited knowledge of orted).
>>>>>>>>>
>>>>>>>>> Either way is fine - I'll see if I can get to it.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For the record, here is the patch I'm currently testing at large
>>>>>>>>>> scale :
>>>>>>>>>>
>>>>>>>>>> diff -r ec68298b3169 -r b622b9e8f1ac
>>>>>>>>>> orte/mca/grpcomm/bad/grpcomm_bad_module.c
>>>>>>>>>> --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16
>>>>>>>>>> 2009 +0100
>>>>>>>>>> +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55
>>>>>>>>>> 2009 +0100
>>>>>>>>>> @@ -687,14 +687,6 @@
>>>>>>>>>> opal_list_append(&orte_local_jobdata, &jobdat->super);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> - /* it may be possible to get here prior to having actually
>>>>>>>>>> finished processing our
>>>>>>>>>> - * local launch msg due to the race condition between
>>>>>>>>>> different nodes and when
>>>>>>>>>> - * they start their individual procs. Hence, we have to first
>>>>>>>>>> ensure that we
>>>>>>>>>> - * -have- finished processing the launch msg, or else we won't
>>>>>>>>>> know whether
>>>>>>>>>> - * or not to wait before sending this on
>>>>>>>>>> - */
>>>>>>>>>> - ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
>>>>>>>>>> -
>>>>>>>>>> /* unpack the collective type */
>>>>>>>>>> n = 1;
>>>>>>>>>> if (ORTE_SUCCESS != (rc = opal_dss.unpack(data,
>>>>>>>>>> &jobdat->collective_type, &n, ORTE_GRPCOMM_COLL_T))) {
>>>>>>>>>> @@ -894,6 +886,28 @@
>>>>>>>>>>
>>>>>>>>>> proc = &mev->sender;
>>>>>>>>>> buf = mev->buffer;
>>>>>>>>>> +
>>>>>>>>>> + jobdat = NULL;
>>>>>>>>>> + for (item = opal_list_get_first(&orte_local_jobdata);
>>>>>>>>>> + item != opal_list_get_end(&orte_local_jobdata);
>>>>>>>>>> + item = opal_list_get_next(item)) {
>>>>>>>>>> + jobdat = (orte_odls_job_t*)item;
>>>>>>>>>> +
>>>>>>>>>> + /* is this the specified job? */
>>>>>>>>>> + if (jobdat->jobid == proc->jobid) {
>>>>>>>>>> + break;
>>>>>>>>>> + }
>>>>>>>>>> + }
>>>>>>>>>> + if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
>>>>>>>>>> + /* it may be possible to get here prior to having actually
>>>>>>>>>> finished processing our
>>>>>>>>>> + * local launch msg due to the race condition between
>>>>>>>>>> different nodes and when
>>>>>>>>>> + * they start their individual procs. Hence, we have to
>>>>>>>>>> first ensure that we
>>>>>>>>>> + * -have- finished processing the launch msg. Requeue this
>>>>>>>>>> event until it is done.
>>>>>>>>>> + */
>>>>>>>>>> + int tag = &mev->tag;
>>>>>>>>>> + ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
>>>>>>>>>> + return;
>>>>>>>>>> + }
>>>>>>>>>>
>>>>>>>>>> /* is the sender a local proc, or a daemon relaying the collective?
>>>>>>>>>> */
>>>>>>>>>> if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {
>>>>>>>>>>
>>>>>>>>>> Sylvain
>>>>>>>>>>
>>>>>>>>>> On Thu, 19 Nov 2009, Ralph Castain wrote:
>>>>>>>>>>
>>>>>>>>>>> Very strange. As I said, we routinely launch jobs spanning several
>>>>>>>>>>> hundred nodes without problem. You can see the platform files for
>>>>>>>>>>> that setup in contrib/platform/lanl/tlcc
>>>>>>>>>>>
>>>>>>>>>>> That said, it is always possible you are hitting some kind of race
>>>>>>>>>>> condition we don't hit. In looking at the code, one possibility
>>>>>>>>>>> would be to make all the communications flow through the daemon
>>>>>>>>>>> cmd processor in orte/orted_comm.c. This is the way it used to
>>>>>>>>>>> work until I reorganized the code a year ago for other reasons
>>>>>>>>>>> that never materialized.
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, the daemon collective has to wait until the local
>>>>>>>>>>> launch cmd has been completely processed so it can know whether or
>>>>>>>>>>> not to wait for contributions from local procs before sending
>>>>>>>>>>> along the collective message, so this kinda limits our options.
>>>>>>>>>>>
>>>>>>>>>>> About the only other thing you could do would be to not send the
>>>>>>>>>>> relay at all until -after- processing the local launch cmd. You
>>>>>>>>>>> can then remove the "wait" in the daemon collective as you will
>>>>>>>>>>> know how many local procs are involved, if any.
>>>>>>>>>>>
>>>>>>>>>>> I used to do it that way and it guarantees it will work. The
>>>>>>>>>>> negative is that we lose some launch speed as the next nodes in
>>>>>>>>>>> the tree don't get the launch message until this node finishes
>>>>>>>>>>> launching all its procs.
>>>>>>>>>>>
>>>>>>>>>>> The way around that, of course, would be to:
>>>>>>>>>>>
>>>>>>>>>>> 1. process the launch message, thus extracting the number of any
>>>>>>>>>>> local procs and setting up all data structures...but do -not-
>>>>>>>>>>> launch the procs at this time (as this is what takes all the time)
>>>>>>>>>>>
>>>>>>>>>>> 2. send the relay - the daemon collective can now proceed without
>>>>>>>>>>> a "wait" in it
>>>>>>>>>>>
>>>>>>>>>>> 3. now launch the local procs
>>>>>>>>>>>
>>>>>>>>>>> It would be a fairly simple reorganization of the code in the
>>>>>>>>>>> orte/mca/odls area. I can do it this weekend if you like, or you
>>>>>>>>>>> can do it - either way is fine, but if you do it, please
>>>>>>>>>>> contribute it back to the trunk.
>>>>>>>>>>>
>>>>>>>>>>> Ralph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I would say I use the default settings, i.e. I don't set anything
>>>>>>>>>>>> "special" at configure.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm launching my processes with SLURM (salloc + mpirun).
>>>>>>>>>>>>
>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 18 Nov 2009, Ralph Castain wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> How did you configure OMPI?
>>>>>>>>>>>>>
>>>>>>>>>>>>> What launch mechanism are you using - ssh?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't think so, and I'm not doing it explicitely at least.
>>>>>>>>>>>>>> How do I know ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, 17 Nov 2009, Ralph Castain wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We routinely launch across thousands of nodes without a
>>>>>>>>>>>>>>> problem...I have never seen it stick in this fashion.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Did you build and/or are using ORTE threaded by any chance? If
>>>>>>>>>>>>>>> so, that definitely won't work.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are currently experiencing problems at launch on the 1.5
>>>>>>>>>>>>>>>> branch on relatively large number of nodes (at least 80).
>>>>>>>>>>>>>>>> Some processes are not spawned and orted processes are
>>>>>>>>>>>>>>>> deadlocked.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When MPI processes are calling MPI_Init before send_relay is
>>>>>>>>>>>>>>>> complete, the send_relay function and the daemon_collective
>>>>>>>>>>>>>>>> function are doing a nice interlock :
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here is the scenario :
>>>>>>>>>>>>>>>>> send_relay
>>>>>>>>>>>>>>>> performs the send tree :
>>>>>>>>>>>>>>>>> orte_rml_oob_send_buffer
>>>>>>>>>>>>>>>>> orte_rml_oob_send
>>>>>>>>>>>>>>>>> opal_wait_condition
>>>>>>>>>>>>>>>> Waiting on completion from send thus calling opal_progress()
>>>>>>>>>>>>>>>>> opal_progress()
>>>>>>>>>>>>>>>> But since a collective request arrived from the network,
>>>>>>>>>>>>>>>> entered :
>>>>>>>>>>>>>>>>> daemon_collective
>>>>>>>>>>>>>>>> However, daemon_collective is waiting for the job to be
>>>>>>>>>>>>>>>> initialized (wait on jobdat->launch_msg_processed) before
>>>>>>>>>>>>>>>> continuing, thus calling :
>>>>>>>>>>>>>>>>> opal_progress()
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> At this time, the send may complete, but since we will never
>>>>>>>>>>>>>>>> go back to orte_rml_oob_send, we will never perform the
>>>>>>>>>>>>>>>> launch (setting jobdat->launch_msg_processed to 1).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I may try to solve the bug (this is quite a top priority
>>>>>>>>>>>>>>>> problem for me), but maybe people who are more familiar with
>>>>>>>>>>>>>>>> orted than I am may propose a nice and clean solution ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For those who like real (and complete) gdb stacks, here they
>>>>>>>>>>>>>>>> are :
>>>>>>>>>>>>>>>> #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
>>>>>>>>>>>>>>>> #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230,
>>>>>>>>>>>>>>>> arg=0x91a4b0, tv=0x7fff0d977880) at poll.c:167
>>>>>>>>>>>>>>>> #2 0x00007fd0de5d586f in opal_event_base_loop
>>>>>>>>>>>>>>>> (base=0x930230, flags=1) at event.c:823
>>>>>>>>>>>>>>>> #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at
>>>>>>>>>>>>>>>> event.c:746
>>>>>>>>>>>>>>>> #4 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>>>>>>> #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50,
>>>>>>>>>>>>>>>> data=0x97b010) at grpcomm_bad_module.c:696
>>>>>>>>>>>>>>>> #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1,
>>>>>>>>>>>>>>>> data=0x97af20) at grpcomm_bad_module.c:901
>>>>>>>>>>>>>>>> #7 0x00007fd0de5d5334 in event_process_active
>>>>>>>>>>>>>>>> (base=0x930230) at event.c:667
>>>>>>>>>>>>>>>> #8 0x00007fd0de5d597a in opal_event_base_loop
>>>>>>>>>>>>>>>> (base=0x930230, flags=1) at event.c:839
>>>>>>>>>>>>>>>> #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at
>>>>>>>>>>>>>>>> event.c:746
>>>>>>>>>>>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700,
>>>>>>>>>>>>>>>> data=0x9676e0) at grpcomm_bad_module.c:696
>>>>>>>>>>>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1,
>>>>>>>>>>>>>>>> data=0x9796d0) at grpcomm_bad_module.c:901
>>>>>>>>>>>>>>>> #13 0x00007fd0de5d5334 in event_process_active
>>>>>>>>>>>>>>>> (base=0x930230) at event.c:667
>>>>>>>>>>>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop
>>>>>>>>>>>>>>>> (base=0x930230, flags=1) at event.c:839
>>>>>>>>>>>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at
>>>>>>>>>>>>>>>> event.c:746
>>>>>>>>>>>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420,
>>>>>>>>>>>>>>>> data=0x97b4e0) at grpcomm_bad_module.c:696
>>>>>>>>>>>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1,
>>>>>>>>>>>>>>>> data=0x97b3f0) at grpcomm_bad_module.c:901
>>>>>>>>>>>>>>>> #19 0x00007fd0de5d5334 in event_process_active
>>>>>>>>>>>>>>>> (base=0x930230) at event.c:667
>>>>>>>>>>>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop
>>>>>>>>>>>>>>>> (base=0x930230, flags=1) at event.c:839
>>>>>>>>>>>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at
>>>>>>>>>>>>>>>> event.c:746
>>>>>>>>>>>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>>>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210,
>>>>>>>>>>>>>>>> m=0x97b1a8) at ../../../../opal/threads/condition.h:99
>>>>>>>>>>>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send
>>>>>>>>>>>>>>>> (peer=0x7fff0d9785a0, iov=0x7fff0d978530, count=1, tag=1,
>>>>>>>>>>>>>>>> flags=16) at rml_oob_send.c:153
>>>>>>>>>>>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer
>>>>>>>>>>>>>>>> (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0)
>>>>>>>>>>>>>>>> at rml_oob_send.c:270
>>>>>>>>>>>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at
>>>>>>>>>>>>>>>> orted/orted_comm.c:127
>>>>>>>>>>>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1,
>>>>>>>>>>>>>>>> opal_event=1, data=0x965fc0) at orted/orted_comm.c:308
>>>>>>>>>>>>>>>> #28 0x00007fd0de5d5334 in event_process_active
>>>>>>>>>>>>>>>> (base=0x930230) at event.c:667
>>>>>>>>>>>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop
>>>>>>>>>>>>>>>> (base=0x930230, flags=0) at event.c:839
>>>>>>>>>>>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at
>>>>>>>>>>>>>>>> event.c:746
>>>>>>>>>>>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at
>>>>>>>>>>>>>>>> event.c:682
>>>>>>>>>>>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19,
>>>>>>>>>>>>>>>> argv=0x7fff0d979ca8) at orted/orted_main.c:769
>>>>>>>>>>>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8)
>>>>>>>>>>>>>>>> at orted.c:62
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>