Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Routed 'unity' broken on trunk
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-03-31 13:43:28


Okay - fixed with r18040

Thanks
Ralph

On 3/31/08 11:01 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:

>
> On Mar 31, 2008, at 12:57 PM, Ralph H Castain wrote:
>
>>
>>
>>
>> On 3/31/08 9:28 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>
>>> At the moment I only use unity with C/R. Mostly because I have not
>>> verified that the other components work properly under the C/R
>>> conditions. I can verify others, but that doesn't solve the problem
>>> with the unity component. :/
>>>
>>> It is not critical that these jobs launch quickly, but that they
>>> launch correctly for the moment. When you say 'slow the launch' are
>>> you talking severely as in seconds/minutes for small nps?
>>
>> I didn't say "severely" - I said "measurably". ;-)
>>
>> It will require an additional communication to the daemons to let
>> them know
>> how to talk to the procs. In the current unity component, the
>> daemons never
>> talk to the procs themselves, and so they don't know contact info for
>> rank=0.
>
> ah I see.
>
>>
>>
>>> I guess a
>>> followup question is why did this component break in the first place?
>>> or worded differently, what changed in ORTE such that the unity
>>> component will suddenly deadlock when it didn't before?
>>
>> We are trying to improve scalability. Biggest issue is the modex,
>> which we
>> improved considerably by having the procs pass the modex info to the
>> daemons, letting the daemons collect all modex info from procs on
>> their
>> node, and then having the daemons send that info along to the rank=0
>> proc
>> for collection and xcast.
>>
>> Problem is that in the unity component, the local daemons don't know
>> how to
>> send the modex to the rank=0 proc. So what I will now have to do is
>> tell all
>> the daemons how to talk to the procs, and then we will have every
>> daemon
>> opening a socket to rank=0. That's where the time will be lost.
>>
>> Our original expectation was to get everyone off of unity as quickly
>> as
>> possible - in fact, Brian and I had planned to completely remove that
>> component as quickly as possible as it (a) scales ugly and (b) gets
>> in the
>> way of things. Very hard to keep it alive.
>>
>> So for now, I'll just do the simple thing and hopefully that will be
>> adequate - let me know if/when you are able to get C/R working on
>> other
>> routed components.
>
> Sounds good. I'll look into supporting the tree routed component, but
> that will probably take a couple weeks.
>
> Thanks for the clarification.
>
> Cheers,
> Josh
>
>>
>>
>> Thanks!
>> Ralph
>>
>>>
>>> Thanks for looking into this,
>>> Josh
>>>
>>> On Mar 31, 2008, at 11:10 AM, Ralph H Castain wrote:
>>>
>>>> I figured out the issue - there is a simple and a hard way to fix
>>>> this. So
>>>> before I do, let me see what makes sense.
>>>>
>>>> The simple solution involves updating the daemons with contact info
>>>> for the
>>>> procs so that they can send their collected modex info to the rank=0
>>>> proc.
>>>> This will measurably slow the launch when using unity.
>>>>
>>>> The hard solution is to do a hybrid routed approach whereby the
>>>> daemons
>>>> would route any daemon-to-proc communication while the procs
>>>> continue to do
>>>> direct proc-to-proc messaging.
>>>>
>>>> Is there some reason to be using the "unity" component? Do you care
>>>> if jobs
>>>> using unity launch slower?
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>>
>>>>
>>>> On 3/31/08 7:57 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>
>>>>> Ralph,
>>>>>
>>>>> I've just noticed that it seems that the 'unity' routed component
>>>>> seems to be broken when using more than one machine. I'm using Odin
>>>>> and r18028 of the trunk, and have confirmed that this problem
>>>>> occurs
>>>>> with SLURM and rsh. I think this break came in on Friday as that is
>>>>> when some of my MTT tests started to hang and fail, but I cannot
>>>>> point
>>>>> to a specific revision at this point. The backtraces (enclosed) of
>>>>> the
>>>>> processes point to the grpcomm allgather routine.
>>>>>
>>>>> The 'noop' program calls MPI_Init, sleeps, then calls MPI_Finalize.
>>>>>
>>>>> RSH example from odin023 - so no SLURM variables:
>>>>> These work:
>>>>> shell$ mpirun -np 2 -host odin023 noop -v 1
>>>>> shell$ mpirun -np 2 -host odin023,odin024 noop -v 1
>>>>> shell$ mpirun -np 2 -mca routed unity -host odin023 noop -v 1
>>>>>
>>>>> This hangs:
>>>>> shell$ mpirun -np 2 -mca routed unity -host odin023,odin024 noop -
>>>>> v 1
>>>>>
>>>>>
>>>>> If I attach to the 'noop' process on odin023 I get the following
>>>>> backtrace:
>>>>> ------------------------------------------------
>>>>> (gdb) bt
>>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b330,
>>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506c30,
>>>>> arg=0x506910,
>>>>> tv=0x7fbfffe840) at epoll.c:210
>>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506c30,
>>>>> flags=5) at event.c:779
>>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
>>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/
>>>>> opal_progress.c:
>>>>> 169
>>>>> #6 0x0000002a958b9e48 in orte_grpcomm_base_allgather
>>>>> (sbuf=0x7fbfffeae0, rbuf=0x7fbfffea80) at base/
>>>>> grpcomm_base_allgather.c:238
>>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>>> base/
>>>>> grpcomm_base_modex.c:413
>>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffed58,
>>>>> requested=0, provided=0x7fbfffec38) at runtime/ompi_mpi_init.c:510
>>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffec7c,
>>>>> argv=0x7fbfffec70) at pinit.c:88
>>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffed58) at
>>>>> noop.c:39
>>>>> ------------------------------------------------
>>>>>
>>>>> The 'noop' process on odin024 has a similar backtrace:
>>>>> ------------------------------------------------
>>>>> (gdb) bt
>>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b390,
>>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506cc0,
>>>>> arg=0x506c20,
>>>>> tv=0x7fbfffe9d0) at epoll.c:210
>>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506cc0,
>>>>> flags=5) at event.c:779
>>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
>>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/
>>>>> opal_progress.c:
>>>>> 169
>>>>> #6 0x0000002a958b97c5 in orte_grpcomm_base_allgather
>>>>> (sbuf=0x7fbfffec70, rbuf=0x7fbfffec10) at base/
>>>>> grpcomm_base_allgather.c:163
>>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>>> base/
>>>>> grpcomm_base_modex.c:413
>>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffeee8,
>>>>> requested=0, provided=0x7fbfffedc8) at runtime/ompi_mpi_init.c:510
>>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffee0c,
>>>>> argv=0x7fbfffee00) at pinit.c:88
>>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffeee8) at
>>>>> noop.c:39
>>>>> ------------------------------------------------
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Josh
>>>>
>>>
>>
>