Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Routed 'unity' broken on trunk
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-03-31 13:01:18


On Mar 31, 2008, at 12:57 PM, Ralph H Castain wrote:

>
>
>
> On 3/31/08 9:28 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>> At the moment I only use unity with C/R. Mostly because I have not
>> verified that the other components work properly under the C/R
>> conditions. I can verify others, but that doesn't solve the problem
>> with the unity component. :/
>>
>> It is not critical that these jobs launch quickly, but that they
>> launch correctly for the moment. When you say 'slow the launch' are
>> you talking severely as in seconds/minutes for small nps?
>
> I didn't say "severely" - I said "measurably". ;-)
>
> It will require an additional communication to the daemons to let
> them know
> how to talk to the procs. In the current unity component, the
> daemons never
> talk to the procs themselves, and so they don't know contact info for
> rank=0.

ah I see.

>
>
>> I guess a
>> followup question is why did this component break in the first place?
>> or worded differently, what changed in ORTE such that the unity
>> component will suddenly deadlock when it didn't before?
>
> We are trying to improve scalability. Biggest issue is the modex,
> which we
> improved considerably by having the procs pass the modex info to the
> daemons, letting the daemons collect all modex info from procs on
> their
> node, and then having the daemons send that info along to the rank=0
> proc
> for collection and xcast.
>
> Problem is that in the unity component, the local daemons don't know
> how to
> send the modex to the rank=0 proc. So what I will now have to do is
> tell all
> the daemons how to talk to the procs, and then we will have every
> daemon
> opening a socket to rank=0. That's where the time will be lost.
>
> Our original expectation was to get everyone off of unity as quickly
> as
> possible - in fact, Brian and I had planned to completely remove that
> component as quickly as possible as it (a) scales ugly and (b) gets
> in the
> way of things. Very hard to keep it alive.
>
> So for now, I'll just do the simple thing and hopefully that will be
> adequate - let me know if/when you are able to get C/R working on
> other
> routed components.

Sounds good. I'll look into supporting the tree routed component, but
that will probably take a couple weeks.

Thanks for the clarification.

Cheers,
Josh

>
>
> Thanks!
> Ralph
>
>>
>> Thanks for looking into this,
>> Josh
>>
>> On Mar 31, 2008, at 11:10 AM, Ralph H Castain wrote:
>>
>>> I figured out the issue - there is a simple and a hard way to fix
>>> this. So
>>> before I do, let me see what makes sense.
>>>
>>> The simple solution involves updating the daemons with contact info
>>> for the
>>> procs so that they can send their collected modex info to the rank=0
>>> proc.
>>> This will measurably slow the launch when using unity.
>>>
>>> The hard solution is to do a hybrid routed approach whereby the
>>> daemons
>>> would route any daemon-to-proc communication while the procs
>>> continue to do
>>> direct proc-to-proc messaging.
>>>
>>> Is there some reason to be using the "unity" component? Do you care
>>> if jobs
>>> using unity launch slower?
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>>
>>> On 3/31/08 7:57 AM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>>>
>>>> Ralph,
>>>>
>>>> I've just noticed that it seems that the 'unity' routed component
>>>> seems to be broken when using more than one machine. I'm using Odin
>>>> and r18028 of the trunk, and have confirmed that this problem
>>>> occurs
>>>> with SLURM and rsh. I think this break came in on Friday as that is
>>>> when some of my MTT tests started to hang and fail, but I cannot
>>>> point
>>>> to a specific revision at this point. The backtraces (enclosed) of
>>>> the
>>>> processes point to the grpcomm allgather routine.
>>>>
>>>> The 'noop' program calls MPI_Init, sleeps, then calls MPI_Finalize.
>>>>
>>>> RSH example from odin023 - so no SLURM variables:
>>>> These work:
>>>> shell$ mpirun -np 2 -host odin023 noop -v 1
>>>> shell$ mpirun -np 2 -host odin023,odin024 noop -v 1
>>>> shell$ mpirun -np 2 -mca routed unity -host odin023 noop -v 1
>>>>
>>>> This hangs:
>>>> shell$ mpirun -np 2 -mca routed unity -host odin023,odin024 noop -
>>>> v 1
>>>>
>>>>
>>>> If I attach to the 'noop' process on odin023 I get the following
>>>> backtrace:
>>>> ------------------------------------------------
>>>> (gdb) bt
>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b330,
>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506c30,
>>>> arg=0x506910,
>>>> tv=0x7fbfffe840) at epoll.c:210
>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506c30,
>>>> flags=5) at event.c:779
>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/
>>>> opal_progress.c:
>>>> 169
>>>> #6 0x0000002a958b9e48 in orte_grpcomm_base_allgather
>>>> (sbuf=0x7fbfffeae0, rbuf=0x7fbfffea80) at base/
>>>> grpcomm_base_allgather.c:238
>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>> base/
>>>> grpcomm_base_modex.c:413
>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffed58,
>>>> requested=0, provided=0x7fbfffec38) at runtime/ompi_mpi_init.c:510
>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffec7c,
>>>> argv=0x7fbfffec70) at pinit.c:88
>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffed58) at
>>>> noop.c:39
>>>> ------------------------------------------------
>>>>
>>>> The 'noop' process on odin024 has a similar backtrace:
>>>> ------------------------------------------------
>>>> (gdb) bt
>>>> #0 0x0000002a96226b39 in syscall () from /lib64/tls/libc.so.6
>>>> #1 0x0000002a95a1e485 in epoll_wait (epfd=3, events=0x50b390,
>>>> maxevents=1023, timeout=1000) at epoll_sub.c:61
>>>> #2 0x0000002a95a1e7f7 in epoll_dispatch (base=0x506cc0,
>>>> arg=0x506c20,
>>>> tv=0x7fbfffe9d0) at epoll.c:210
>>>> #3 0x0000002a95a1c057 in opal_event_base_loop (base=0x506cc0,
>>>> flags=5) at event.c:779
>>>> #4 0x0000002a95a1be8f in opal_event_loop (flags=5) at event.c:702
>>>> #5 0x0000002a95a0bef8 in opal_progress () at runtime/
>>>> opal_progress.c:
>>>> 169
>>>> #6 0x0000002a958b97c5 in orte_grpcomm_base_allgather
>>>> (sbuf=0x7fbfffec70, rbuf=0x7fbfffec10) at base/
>>>> grpcomm_base_allgather.c:163
>>>> #7 0x0000002a958bd37c in orte_grpcomm_base_modex (procs=0x0) at
>>>> base/
>>>> grpcomm_base_modex.c:413
>>>> #8 0x0000002a956b8416 in ompi_mpi_init (argc=3, argv=0x7fbfffeee8,
>>>> requested=0, provided=0x7fbfffedc8) at runtime/ompi_mpi_init.c:510
>>>> #9 0x0000002a956f2109 in PMPI_Init (argc=0x7fbfffee0c,
>>>> argv=0x7fbfffee00) at pinit.c:88
>>>> #10 0x0000000000400bf4 in main (argc=3, argv=0x7fbfffeee8) at
>>>> noop.c:39
>>>> ------------------------------------------------
>>>>
>>>>
>>>> Cheers,
>>>> Josh
>>>
>>
>