Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Communication Failure with orted_comm.c
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-03-08 08:15:44


Hello @ll.

I've got a problem in a communication between
the*v_protocol_receiver_component.c
* and the *orted_comm.c. *

In the *mca_vprotocol_receiver_component_init* i've added a request that is
received correctly by the *orte_daemon_process_commands *but when i try to
reply to the sender i get the next error:

[clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2aaaabb03d40]
[clus1:15593] [ 1]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad760db]
[clus1:15593] [ 2]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad75aa4]
[clus1:15593] [ 3]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so
[0x2aaaae2d2fdd]
[clus1:15593] [ 4]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da)
[0x2aaaaad42cb0]
[clus1:15593] [ 5]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068)
[0x2aaaaad19ca6]
[clus1:15593] [ 6]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b)
[0x2aaaaad18a55]
[clus1:15593] [ 7]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad9710e]
[clus1:15593] [ 8]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad974bb]
[clus1:15593] [ 9]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a)
[0x2aaaaad972ad]
[clus1:15593] [10]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe)
[0x2aaaaad97166]
[clus1:15593] [11]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322)
[0x2aaaaad17556]
[clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
[0x4008a3]
[clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaabd2d8a4]
[clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
[0x400799]
[clus1:15593] *** End of error message ***

The code that i've added at the *v_protocol_receiver_component.c *is (in
bold the recv command that fails):

int mca_vprotocol_receiver_request_protector(void) {
    orte_daemon_cmd_flag_t command;
    opal_buffer_t *buffer = NULL;
    int n = 1;

    command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD;

    buffer = OBJ_NEW(opal_buffer_t);
    opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD);

    orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON,
0);

    *orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer,
ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);*
    opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n,
OPAL_UINT32);
    opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n,
OPAL_UINT32);

    orte_process_info.protector.jobid =
mca_vprotocol_receiver.protector.jobid;
    orte_process_info.protector.vpid =
mca_vprotocol_receiver.protector.vpid;

    OBJ_RELEASE(buffer);

    return OMPI_SUCCESS;

The code that i've added at the *orted_comm.c *is (in bold the send command
that fails):

case ORTE_DAEMON_REQUEST_PROTECTOR_CMD:
        if (orte_debug_daemons_flag) {
            opal_output(0, "%s orted_recv: received request protector from
local proc %s",
                        ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(sender));
        }
        /* Define the protector */
        protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1;
        if (protector >= (uint32_t)orte_process_info.num_procs) {
            protector = 0;
        }

        /* Pack the protector data */
        answer = OBJ_NEW(opal_buffer_t);

        if (ORTE_SUCCESS != (ret = opal_dss.pack(answer,
&ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) {
            ORTE_ERROR_LOG(ret);
            OBJ_RELEASE(answer);
            goto CLEANUP;
        }
        if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1,
OPAL_UINT32))) {
            ORTE_ERROR_LOG(ret);
            OBJ_RELEASE(answer);
            goto CLEANUP;
        }
        if (orte_debug_daemons_flag) {
            opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n",
                        ORTE_NAME_PRINT(sender), protector);
        }

        /* Send the protector data */
        *if (0 > orte_rml.send_buffer(sender, answer,
ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {*
* ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);*
* ret = ORTE_ERR_COMM_FAILURE;*
* OBJ_RELEASE(answer);*
* goto CLEANUP;*
        }

        OBJ_RELEASE(answer);

I assume by testing that the error is in the bolded section, maybe because
i'am missing some sentence when i try to communicate, or maybe this
communication cannot be done. Any help will be appreciated.

Thanks a lot.

Hugo Meyer