On Jan 10, 2014, at 12:45 PM, Adrian Reber <adrian@lisas.de> wrote:

On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:

On Jan 10, 2014, at 8:02 AM, Adrian Reber <adrian@lisas.de> wrote:

I am currently trying to understand how callbacks are working. Right now
I am looking at orte/mca/rml/base/rml_base_receive.c
orte_rml_base_comm_start() which does 

  orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
                          ORTE_RML_TAG_RML_INFO_UPDATE,
                          ORTE_RML_PERSISTENT,
                          orte_rml_base_recv,
                          NULL);

As far as I understand it orte_rml_base_recv() is the callback function.
At which point should this function run? When the data is actually
received?

Not precisely. When data is received by the OOB, it pushes the data into an event. When that event gets serviced, it calls the orte_rml_base_receive function which processes the data to find the matching tag, and then uses that to execute the callback to the user code.


The same for send_buffer_nb() functions. I do not see the callback
functions actually running. How can I verify that the callback functions
are running. Especially for the send case it sounds pretty obvious how
it should work but I never see the callback function running. At least
in my setup.

The data is not immediately sent. It gets pushed into an event. When that event gets serviced, it calls the orte_oob_base_send function which then passes the data to each active OOB component until one of them says it can send it. The data is then pushed into another event to get it into the event base for that component's active module - when that event gets serviced, the data is sent. Once the data is sent, an event is created that, when serviced, executes the callback to the user code.

If you aren't seeing callbacks, the most likely cause is that the orte progress thread isn't running. Without it, none of this will work.

Thanks. Running configure without '--with-ft=cr' I can run a program and
use orte-top. In orterun I can see that the callback is running and
orte-top displays the retrieved information. I can also see in orte-top
that the callbacks are working.

Actually, I'm rather impressed - I hadn't tested orte-top and didn't honestly know if it would work any more! Glad to hear it does :-)

Doing the same with '--with-ft=cr'
enabled orte-top crashes as well as orte-checkpoint and both (-top and
-checkpoint) seem to no longer have working callbacks and that is why
they are probably crashing. So some code which is enabled by '--with-ft=cr'
seems to break callbacks in orte-top as well as in orte-checkpoint.
orterun handles callbacks no matter if configured with or without
'--with-ft=cr'.

I can take a look this weekend - probably something silly


Adrian
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel