Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-11-19 09:52:45


Thank you Ralph for this precious help.

I setup a quick-and-dirty patch basically postponing process_msg (hence
daemon_collective) until the launch is done. In process_msg, I therefore
requeue a process_msg handler and return.

In this "all-must-be-non-blocking-and-done-through-opal_progress"
algorithm, I don't think that blocking calls like the one in
daemon_collective should be allowed. This also applies to the blocking one
in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.]

If you have time doing a nicer patch, it would be great and I would be
happy to test it. Otherwise, I will try to implement your idea properly
next week (with my limited knowledge of orted).

For the record, here is the patch I'm currently testing at large scale :

diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c
--- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100
+++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100
@@ -687,14 +687,6 @@
          opal_list_append(&orte_local_jobdata, &jobdat->super);
      }

- /* it may be possible to get here prior to having actually finished processing our
- * local launch msg due to the race condition between different nodes and when
- * they start their individual procs. Hence, we have to first ensure that we
- * -have- finished processing the launch msg, or else we won't know whether
- * or not to wait before sending this on
- */
- ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
-
      /* unpack the collective type */
      n = 1;
      if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, &jobdat->collective_type, &n, ORTE_GRPCOMM_COLL_T))) {
@@ -894,6 +886,28 @@

      proc = &mev->sender;
      buf = mev->buffer;
+
+ jobdat = NULL;
+ for (item = opal_list_get_first(&orte_local_jobdata);
+ item != opal_list_get_end(&orte_local_jobdata);
+ item = opal_list_get_next(item)) {
+ jobdat = (orte_odls_job_t*)item;
+
+ /* is this the specified job? */
+ if (jobdat->jobid == proc->jobid) {
+ break;
+ }
+ }
+ if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
+ /* it may be possible to get here prior to having actually finished processing our
+ * local launch msg due to the race condition between different nodes and when
+ * they start their individual procs. Hence, we have to first ensure that we
+ * -have- finished processing the launch msg. Requeue this event until it is done.
+ */
+ int tag = &mev->tag;
+ ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
+ return;
+ }

      /* is the sender a local proc, or a daemon relaying the collective? */
      if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {

Sylvain

On Thu, 19 Nov 2009, Ralph Castain wrote:

> Very strange. As I said, we routinely launch jobs spanning several
> hundred nodes without problem. You can see the platform files for that
> setup in contrib/platform/lanl/tlcc
>
> That said, it is always possible you are hitting some kind of race
> condition we don't hit. In looking at the code, one possibility would be
> to make all the communications flow through the daemon cmd processor in
> orte/orted_comm.c. This is the way it used to work until I reorganized
> the code a year ago for other reasons that never materialized.
>
> Unfortunately, the daemon collective has to wait until the local launch
> cmd has been completely processed so it can know whether or not to wait
> for contributions from local procs before sending along the collective
> message, so this kinda limits our options.
>
> About the only other thing you could do would be to not send the relay
> at all until -after- processing the local launch cmd. You can then
> remove the "wait" in the daemon collective as you will know how many
> local procs are involved, if any.
>
> I used to do it that way and it guarantees it will work. The negative is
> that we lose some launch speed as the next nodes in the tree don't get
> the launch message until this node finishes launching all its procs.
>
> The way around that, of course, would be to:
>
> 1. process the launch message, thus extracting the number of any local
> procs and setting up all data structures...but do -not- launch the procs
> at this time (as this is what takes all the time)
>
> 2. send the relay - the daemon collective can now proceed without a
> "wait" in it
>
> 3. now launch the local procs
>
> It would be a fairly simple reorganization of the code in the
> orte/mca/odls area. I can do it this weekend if you like, or you can do
> it - either way is fine, but if you do it, please contribute it back to
> the trunk.
>
> Ralph
>
>
> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:
>
>> I would say I use the default settings, i.e. I don't set anything "special" at configure.
>>
>> I'm launching my processes with SLURM (salloc + mpirun).
>>
>> Sylvain
>>
>> On Wed, 18 Nov 2009, Ralph Castain wrote:
>>
>>> How did you configure OMPI?
>>>
>>> What launch mechanism are you using - ssh?
>>>
>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
>>>
>>>> I don't think so, and I'm not doing it explicitely at least. How do I know ?
>>>>
>>>> Sylvain
>>>>
>>>> On Tue, 17 Nov 2009, Ralph Castain wrote:
>>>>
>>>>> We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion.
>>>>>
>>>>> Did you build and/or are using ORTE threaded by any chance? If so, that definitely won't work.
>>>>>
>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We are currently experiencing problems at launch on the 1.5 branch on relatively large number of nodes (at least 80). Some processes are not spawned and orted processes are deadlocked.
>>>>>>
>>>>>> When MPI processes are calling MPI_Init before send_relay is complete, the send_relay function and the daemon_collective function are doing a nice interlock :
>>>>>>
>>>>>> Here is the scenario :
>>>>>>> send_relay
>>>>>> performs the send tree :
>>>>>>> orte_rml_oob_send_buffer
>>>>>>> orte_rml_oob_send
>>>>>> > opal_wait_condition
>>>>>> Waiting on completion from send thus calling opal_progress()
>>>>>> > opal_progress()
>>>>>> But since a collective request arrived from the network, entered :
>>>>>> > daemon_collective
>>>>>> However, daemon_collective is waiting for the job to be initialized (wait on jobdat->launch_msg_processed) before continuing, thus calling :
>>>>>> > opal_progress()
>>>>>>
>>>>>> At this time, the send may complete, but since we will never go back to orte_rml_oob_send, we will never perform the launch (setting jobdat->launch_msg_processed to 1).
>>>>>>
>>>>>> I may try to solve the bug (this is quite a top priority problem for me), but maybe people who are more familiar with orted than I am may propose a nice and clean solution ...
>>>>>>
>>>>>> For those who like real (and complete) gdb stacks, here they are :
>>>>>> #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
>>>>>> #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, tv=0x7fff0d977880) at poll.c:167
>>>>>> #2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at event.c:823
>>>>>> #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>> #4 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
>>>>>> #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) at grpcomm_bad_module.c:696
>>>>>> #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at grpcomm_bad_module.c:901
>>>>>> #7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
>>>>>> #8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at event.c:839
>>>>>> #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) at grpcomm_bad_module.c:696
>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at grpcomm_bad_module.c:901
>>>>>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at event.c:839
>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) at grpcomm_bad_module.c:696
>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at grpcomm_bad_module.c:901
>>>>>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at event.c:839
>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at ../../../../opal/threads/condition.h:99
>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270
>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at orted/orted_comm.c:127
>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, data=0x965fc0) at orted/orted_comm.c:308
>>>>>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at event.c:839
>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746
>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682
>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at orted/orted_main.c:769
>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Sylvain
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>