Thanks for your help. I tried initializing the barrier correctly (see
attached patch) but now, instead of crashing, it just hangs on the
barrier while running orte-checkpoint
[dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
[dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
#0 0x00007ffff69befa0 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007ffff7b456ba in app_coord_init () at ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
#2 0x00007ffff7b3a582 in orte_snapc_full_module_init (seed=false, app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
I do not understand on what the barrier here is actually waiting for. Where
do I need to look to find the place the barrier is waiting for?
I also tried initializing the collective id's in
orte/mca/plm/base/plm_base_launch_support.c but that code is never
used running the orte-checkpoint tool
On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> I took a look at this, and I'm afraid you have some work to do in the orte/mca/snapc code base:
> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See r30261 for an example of the changes that need to be made - I did some, but can't swear to catching them all. It was enough to at least get a proc past the initial snapc registration
> 2. you are reusing collective id's to execute several orte_grpcomm.barrier calls - those ids are used elsewhere during MPI_Init. This is not allowed - a collective id can only be used *once*. What you need to do is go into orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add cr-specific collective id's for this purpose. I don't know how many places in the cr code create their own barriers, but they each need a collective id.
> If you prefer and have the time, you are welcome to extend the collective code to allow id reuse. This would require that each daemon and app "reset" the collective fields when a collective is declared complete. It isn't that hard to do - just never had a reason to do it. I can take a shot at it when time permits (may have some time this weekend)
> 3. when you post the non-blocking recv in the snapc/full code, it looks to me like you need to block until you get the answer. I don't know where in the code flow this is occurring - if you are not in an event, then it is okay to block using ORTE_WAIT_FOR_COMPLETION. Look in orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
> On Jan 10, 2014, at 12:55 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> > On Jan 10, 2014, at 12:45 PM, Adrian Reber <adrian_at_[hidden]> wrote:
> >> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> >>> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adrian_at_[hidden]> wrote:
> >>>> I am currently trying to understand how callbacks are working. Right now
> >>>> I am looking at orte/mca/rml/base/rml_base_receive.c
> >>>> orte_rml_base_comm_start() which does
> >>>> orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >>>> ORTE_RML_TAG_RML_INFO_UPDATE,
> >>>> ORTE_RML_PERSISTENT,
> >>>> orte_rml_base_recv,
> >>>> NULL);
> >>>> As far as I understand it orte_rml_base_recv() is the callback function.
> >>>> At which point should this function run? When the data is actually
> >>>> received?
> >>> Not precisely. When data is received by the OOB, it pushes the data into an event. When that event gets serviced, it calls the orte_rml_base_receive function which processes the data to find the matching tag, and then uses that to execute the callback to the user code.
> >>>> The same for send_buffer_nb() functions. I do not see the callback
> >>>> functions actually running. How can I verify that the callback functions
> >>>> are running. Especially for the send case it sounds pretty obvious how
> >>>> it should work but I never see the callback function running. At least
> >>>> in my setup.
> >>> The data is not immediately sent. It gets pushed into an event. When that event gets serviced, it calls the orte_oob_base_send function which then passes the data to each active OOB component until one of them says it can send it. The data is then pushed into another event to get it into the event base for that component's active module - when that event gets serviced, the data is sent. Once the data is sent, an event is created that, when serviced, executes the callback to the user code.
> >>> If you aren't seeing callbacks, the most likely cause is that the orte progress thread isn't running. Without it, none of this will work.
> >> Thanks. Running configure without '--with-ft=cr' I can run a program and
> >> use orte-top. In orterun I can see that the callback is running and
> >> orte-top displays the retrieved information. I can also see in orte-top
> >> that the callbacks are working.
> > Actually, I'm rather impressed - I hadn't tested orte-top and didn't honestly know if it would work any more! Glad to hear it does :-)
> >> Doing the same with '--with-ft=cr'
> >> enabled orte-top crashes as well as orte-checkpoint and both (-top and
> >> -checkpoint) seem to no longer have working callbacks and that is why
> >> they are probably crashing. So some code which is enabled by '--with-ft=cr'
> >> seems to break callbacks in orte-top as well as in orte-checkpoint.
> >> orterun handles callbacks no matter if configured with or without
> >> '--with-ft=cr'.
> > I can take a look this weekend - probably something silly
> >> Adrian