Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Routed 'unity' broken on trunk
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-03-31 14:41:16


Looks good. Thanks for the fix.

Cheers,
Josh

On Mar 31, 2008, at 1:43 PM, Ralph H Castain wrote:
> Okay - fixed with r18040
>
> Thanks
> Ralph
>
>
> On 3/31/08 11:01 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>>
>> On Mar 31, 2008, at 12:57 PM, Ralph H Castain wrote:
>>
>>>
>>>
>>>
>>> On 3/31/08 9:28 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>
>>>> At the moment I only use unity with C/R. Mostly because I have not
>>>> verified that the other components work properly under the C/R
>>>> conditions. I can verify others, but that doesn't solve the problem
>>>> with the unity component. :/
>>>>
>>>> It is not critical that these jobs launch quickly, but that they
>>>> launch correctly for the moment. When you say 'slow the launch' are
>>>> you talking severely as in seconds/minutes for small nps?
>>>
>>> I didn't say "severely" - I said "measurably". ;-)
>>>
>>> It will require an additional communication to the daemons to let
>>> them know
>>> how to talk to the procs. In the current unity component, the
>>> daemons never
>>> talk to the procs themselves, and so they don't know contact info
>>> for
>>> rank=0.
>>
>> ah I see.
>>
>>>
>>>
>>>> I guess a
>>>> followup question is why did this component break in the first
>>>> place?
>>>> or worded differently, what changed in ORTE such that the unity
>>>> component will suddenly deadlock when it didn't before?
>>>
>>> We are trying to improve scalability. Biggest issue is the modex,
>>> which we
>>> improved considerably by having the procs pass the modex info to the
>>> daemons, letting the daemons collect all modex info from procs on
>>> their
>>> node, and then having the daemons send that info along to the rank=0
>>> proc
>>> for collection and xcast.
>>>
>>> Problem is that in the unity component, the local daemons don't know
>>> how to
>>> send the modex to the rank=0 proc. So what I will now have to do is
>>> tell all
>>> the daemons how to talk to the procs, and then we will have every
>>> daemon
>>> opening a socket to rank=0. That's where the time will be lost.
>>>
>>> Our original expectation was to get everyone off of unity as quickly
>>> as
>>> possible - in fact, Brian and I had planned to completely remove
>>> that
>>> component as quickly as possible as it (a) scales ugly and (b) gets
>>> in the
>>> way of things. Very hard to keep it alive.
>>>
>>> So for now, I'll just do the simple thing and hopefully that will be
>>> adequate - let me know if/when you are able to get C/R working on
>>> other
>>> routed components.
>>
>> Sounds good. I'll look into supporting the tree routed component, but
>> that will probably take a couple weeks.
>>
>> Thanks for the clarification.
>>
>> Cheers,
>> Josh
>>
>>>
>>>
>>> Thanks!
>>> Ralph
>>>
>>>>
>>>> Thanks for looking into this,
>>>> Josh
>>>>
>>>> On Mar 31, 2008, at 11:10 AM, Ralph H Castain wrote:
>>>>
>>>>> I figured out the issue - there is a simple and a hard way to fix
>>>>> this. So
>>>>> before I do, let me see what makes sense.
>>>>>
>>>>> The simple solution involves updating the daemons with contact
>>>>> info
>>>>> for the
>>>>> procs so that they can send their collected modex info to the
>>>>> rank=0
>>>>> proc.
>>>>> This will measurably slow the launch when using unity.
>>>>>
>>>>> The hard solution is to do a hybrid routed approach whereby the
>>>>> daemons
>>>>> would route any daemon-to-proc communication while the procs
>>>>> continue to do
>>>>> direct proc-to-proc messaging.
>>>>>
>>>>> Is there some reason to be using the "unity" component? Do you
>>>>> care
>>>>> if jobs
>>>>> using unity launch slower?
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>
>>>>>
>>>>> On 3/31/08 7:57 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> I've just noticed that it seems that the 'unity' routed component
>>>>>> seems to be broken when using more than one machine. I'm using
>>>>>> Odin
>>>>>> and r18028 of the trunk, and have confirmed that this problem
>>>>>> occurs
>>>>>> with SLURM and rsh. I think this break came in on Friday as
>>>>>> that is
>>>>>> when some of my MTT tests started to hang and fail, but I cannot
>>>>>> point
>>>>>> to a specific revision at this point. The backtraces
>>>>>> (enclosed) of
>>>>>> the
>>>>>> processes point to the grpcomm allgather routine.
>>>>>>
>>>>>> The 'noop' program calls MPI_Init, sleeps, then calls
>>>>>> MPI_Finalize.
>>>>>>
>>>>>> RSH example from odin023 - so no SLURM variables:
>>>>>> These work:
>>>>>> shell$ mpirun -np 2 -host odin023 noop -v 1
>>>>>> shell$ mpirun -np 2 -host odin023,odin024 noop -v 1
>>>>>> shell$ mpirun -np 2 -mca routed unity -host odin023 noop -v 1
>>>>>>
>>>>>> This hangs:
>>>>>> shell$ mpirun -np 2 -mca routed unity -host odin023,odin024
>>>>>> noop -
>>>>>> v 1
>>>>>>
>>>>>>
>>>>>> If I attach to the 'noop' process on odin023 I get the following
>>>>>> backtrace:
>>>>>> ------------------------------------------------
>>>>>> (gdb) bt
>>>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b330,
>>>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506c30,
>>>>>> arg=0x506910,
>>>>>> tv=0x7fbfffe840) at epoll.c:210
>>>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506c30,
>>>>>> flags=5) at event.c:779
>>>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:
>>>>>> 702
>>>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/
>>>>>> opal_progress.c:
>>>>>> 169
>>>>>> #6 0x0000002a958b9e48 in orte_grpcomm_base_allgather
>>>>>> (sbuf=0x7fbfffeae0, rbuf=0x7fbfffea80) at base/
>>>>>> grpcomm_base_allgather.c:238
>>>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>>>> base/
>>>>>> grpcomm_base_modex.c:413
>>>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3,
>>>>>> argv=0x7fbfffed58,
>>>>>> requested=0, provided=0x7fbfffec38) at runtime/ompi_mpi_init.c:
>>>>>> 510
>>>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffec7c,
>>>>>> argv=0x7fbfffec70) at pinit.c:88
>>>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffed58) at
>>>>>> noop.c:39
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> The 'noop' process on odin024 has a similar backtrace:
>>>>>> ------------------------------------------------
>>>>>> (gdb) bt
>>>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b390,
>>>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506cc0,
>>>>>> arg=0x506c20,
>>>>>> tv=0x7fbfffe9d0) at epoll.c:210
>>>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506cc0,
>>>>>> flags=5) at event.c:779
>>>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:
>>>>>> 702
>>>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/
>>>>>> opal_progress.c:
>>>>>> 169
>>>>>> #6 0x0000002a958b97c5 in orte_grpcomm_base_allgather
>>>>>> (sbuf=0x7fbfffec70, rbuf=0x7fbfffec10) at base/
>>>>>> grpcomm_base_allgather.c:163
>>>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>>>> base/
>>>>>> grpcomm_base_modex.c:413
>>>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3,
>>>>>> argv=0x7fbfffeee8,
>>>>>> requested=0, provided=0x7fbfffedc8) at runtime/ompi_mpi_init.c:
>>>>>> 510
>>>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffee0c,
>>>>>> argv=0x7fbfffee00) at pinit.c:88
>>>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffeee8) at
>>>>>> noop.c:39
>>>>>> ------------------------------------------------
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Josh
>>>>>
>>>>
>>>
>>
>