Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] callback debugging
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-21 15:27:55


That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no !). The problem is that orte-checkpoint is a tool, and so it isn't a daemon - but it is also not an app.

On Jan 21, 2014, at 11:56 AM, Adrian Reber <adrian_at_[hidden]> wrote:

> Good to know that it does not make any sense. So it not just me.
>
> Looking at the call chain I can see
>
> orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);
>
> and the second parameter is used to decide if it is an app or not:
>
> int orte_snapc_base_select(bool seed, bool app) in orte/mca/snapc/base/snapc_base_select.c
>
> and if it is true the code with the barrier is used.
>
> In orte/mca/snapc/base/snapc_base_select.c there is also following
> comment:
>
> /* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once the framework system is in place */
>
> Is this something which needs to be changed and which might be the cause
> for this problem?
>
>
> On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote:
>> That doesn't make any sense - I can't imagine a reason for orte-checkpoint itself to be running a barrier. I wonder if it is selecting the wrong component in snapc?
>>
>> As for the patch, that isn't going to work. The collective id has to be *globally* unique, which means that only orterun can issue a new one. So you have to get thru orte_init before you can request one as it requires a communication.
>>
>> However, like I said, it makes no sense for orte-checkpoint to do a barrier as it is a singleton - there is nothing for it to "barrier" with.
>>
>> On Jan 21, 2014, at 7:24 AM, Adrian Reber <adrian_at_[hidden]> wrote:
>>
>>> I think I still do not really understand how it works.
>>>
>>> The barrier on which orte-checkpoint is currently hanging is in
>>> app_coord_init(). You are also saying that orte-checkpoint
>>> should not be calling a barrier. The backtrace of the point where it
>>> is hanging now looks like:
>>>
>>> #0 0x00007ffff69befa0 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81
>>> #1 0x00007ffff7b45712 in app_coord_init () at ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
>>> #2 0x00007ffff7b3a5ce in orte_snapc_full_module_init (seed=false, app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
>>> #3 0x00007ffff7b375de in orte_snapc_base_select (seed=false, app=true) at ../../../../orte/mca/snapc/base/snapc_base_select.c:96
>>> #4 0x00007ffff7a9884a in orte_ess_base_tool_setup () at ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
>>> #5 0x00007ffff7a9fe85 in rte_init () at ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
>>> #6 0x00007ffff7a4647f in orte_init (pargc=0x7fffffffd94c, pargv=0x7fffffffd940, flags=8) at ../../orte/runtime/orte_init.c:158
>>> #7 0x0000000000402859 in ckpt_init (argc=51, argv=0x7fffffffda78) at ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
>>> #8 0x0000000000401d7a in main (argc=51, argv=0x7fffffffda78) at ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
>>>
>>> Maybe I am doing something completely wrong. I am currently
>>> running 'orterun -np 2 test-programm'.
>>>
>>> In another terminal I am starting orte-checkpoint with the PID of
>>> orterun and the barrier in app_coord_init() is just before it tries
>>> to communicate with orterun. Is this the correct setup?
>>>
>>> Adrian
>>>
>>> On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
>>>> If it is the application, then there is probably a barrier in the
>>>> app_coord_init() to make sure all the applications are up and running.
>>>> After this point then the global coordinator knows that the application can
>>>> be checkpointed.
>>>>
>>>> I don't think orte-checkpoint should be calling a barrier - from what I
>>>> recall.
>>>>
>>>>
>>>> On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>>> Is it orte-checkpoint that is hanging, or the app you are trying to
>>>>> checkpoint?
>>>>>
>>>>>
>>>>> On Jan 20, 2014, at 2:10 PM, Adrian Reber <adrian_at_[hidden]> wrote:
>>>>>
>>>>> Thanks for your help. I tried initializing the barrier correctly (see
>>>>> attached patch) but now, instead of crashing, it just hangs on the
>>>>> barrier while running orte-checkpoint
>>>>>
>>>>> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
>>>>> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
>>>>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
>>>>>
>>>>> #0 0x00007ffff69befa0 in __nanosleep_nocancel () at
>>>>> ../sysdeps/unix/syscall-template.S:81
>>>>> #1 0x00007ffff7b456ba in app_coord_init () at
>>>>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
>>>>> #2 0x00007ffff7b3a582 in orte_snapc_full_module_init (seed=false,
>>>>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
>>>>>
>>>>> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
>>>>>
>>>>> I do not understand on what the barrier here is actually waiting for. Where
>>>>> do I need to look to find the place the barrier is waiting for?
>>>>>
>>>>> I also tried initializing the collective id's in
>>>>> orte/mca/plm/base/plm_base_launch_support.c but that code is never
>>>>> used running the orte-checkpoint tool
>>>>>
>>>>> Adrian
>>>>>
>>>>> On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
>>>>>
>>>>> I took a look at this, and I'm afraid you have some work to do in the
>>>>> orte/mca/snapc code base:
>>>>>
>>>>> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See
>>>>> r30261 for an example of the changes that need to be made - I did some, but
>>>>> can't swear to catching them all. It was enough to at least get a proc past
>>>>> the initial snapc registration
>>>>>
>>>>> 2. you are reusing collective id's to execute several orte_grpcomm.barrier
>>>>> calls - those ids are used elsewhere during MPI_Init. This is not allowed -
>>>>> a collective id can only be used *once*. What you need to do is go into
>>>>> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add
>>>>> cr-specific collective id's for this purpose. I don't know how many places
>>>>> in the cr code create their own barriers, but they each need a collective
>>>>> id.
>>>>>
>>>>> If you prefer and have the time, you are welcome to extend the collective
>>>>> code to allow id reuse. This would require that each daemon and app "reset"
>>>>> the collective fields when a collective is declared complete. It isn't that
>>>>> hard to do - just never had a reason to do it. I can take a shot at it when
>>>>> time permits (may have some time this weekend)
>>>>>
>>>>> 3. when you post the non-blocking recv in the snapc/full code, it looks to
>>>>> me like you need to block until you get the answer. I don't know where in
>>>>> the code flow this is occurring - if you are not in an event, then it is
>>>>> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in
>>>>> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
>>>>>
>>>>> HTH
>>>>> Ralph
>>>>>
>>>>> On Jan 10, 2014, at 12:55 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>
>>>>>
>>>>> On Jan 10, 2014, at 12:45 PM, Adrian Reber <adrian_at_[hidden]> wrote:
>>>>>
>>>>> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>>>>>
>>>>>
>>>>> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adrian_at_[hidden]> wrote:
>>>>>
>>>>> I am currently trying to understand how callbacks are working. Right now
>>>>> I am looking at orte/mca/rml/base/rml_base_receive.c
>>>>> orte_rml_base_comm_start() which does
>>>>>
>>>>> orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>>>>> ORTE_RML_TAG_RML_INFO_UPDATE,
>>>>> ORTE_RML_PERSISTENT,
>>>>> orte_rml_base_recv,
>>>>> NULL);
>>>>>
>>>>> As far as I understand it orte_rml_base_recv() is the callback function.
>>>>> At which point should this function run? When the data is actually
>>>>> received?
>>>>>
>>>>>
>>>>> Not precisely. When data is received by the OOB, it pushes the data into
>>>>> an event. When that event gets serviced, it calls the orte_rml_base_receive
>>>>> function which processes the data to find the matching tag, and then uses
>>>>> that to execute the callback to the user code.
>>>>>
>>>>>
>>>>> The same for send_buffer_nb() functions. I do not see the callback
>>>>> functions actually running. How can I verify that the callback functions
>>>>> are running. Especially for the send case it sounds pretty obvious how
>>>>> it should work but I never see the callback function running. At least
>>>>> in my setup.
>>>>>
>>>>>
>>>>> The data is not immediately sent. It gets pushed into an event. When that
>>>>> event gets serviced, it calls the orte_oob_base_send function which then
>>>>> passes the data to each active OOB component until one of them says it can
>>>>> send it. The data is then pushed into another event to get it into the
>>>>> event base for that component's active module - when that event gets
>>>>> serviced, the data is sent. Once the data is sent, an event is created
>>>>> that, when serviced, executes the callback to the user code.
>>>>>
>>>>> If you aren't seeing callbacks, the most likely cause is that the orte
>>>>> progress thread isn't running. Without it, none of this will work.
>>>>>
>>>>>
>>>>> Thanks. Running configure without '--with-ft=cr' I can run a program and
>>>>> use orte-top. In orterun I can see that the callback is running and
>>>>> orte-top displays the retrieved information. I can also see in orte-top
>>>>> that the callbacks are working.
>>>>>
>>>>>
>>>>> Actually, I'm rather impressed - I hadn't tested orte-top and didn't
>>>>> honestly know if it would work any more! Glad to hear it does :-)
>>>>>
>>>>> Doing the same with '--with-ft=cr'
>>>>> enabled orte-top crashes as well as orte-checkpoint and both (-top and
>>>>> -checkpoint) seem to no longer have working callbacks and that is why
>>>>> they are probably crashing. So some code which is enabled by '--with-ft=cr'
>>>>> seems to break callbacks in orte-top as well as in orte-checkpoint.
>>>>> orterun handles callbacks no matter if configured with or without
>>>>> '--with-ft=cr'.
>>>>>
>>>>>
>>>>> I can take a look this weekend - probably something silly
>>>>>
>>>>>
>>>>> Adrian
>>>>>
>>>>> <grpcomm.txt>_______________________________________________
>>>>>
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joshua Hursey
>>>> Assistant Professor of Computer Science
>>>> University of Wisconsin-La Crosse
>>>> http://cs.uwlax.edu/~jjhursey
>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> Adrian
>>>
>>> --
>>> Adrian Reber <adrian_at_[hidden]> http://lisas.de/~adrian/
>>> QOTD:
>>> "I tried buying a goat instead of a lawn tractor; had to return
>>> it though. Couldn't figure out a way to connect the snow blower."
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Adrian
>
> --
> Adrian Reber <adrian_at_[hidden]> http://lisas.de/~adrian/
> Hempstone's Question:
> If you have to travel on the Titanic, why not go first class?
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel