Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] Communication Failure with orted_comm.c
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-03-08 08:15:44


Hello @ll.

I've got a problem in a communication between
the*v_protocol_receiver_component.c
* and the *orted_comm.c. *

In the *mca_vprotocol_receiver_component_init* i've added a request that is
received correctly by the *orte_daemon_process_commands *but when i try to
reply to the sender i get the next error:

[clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2aaaabb03d40]
[clus1:15593] [ 1]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad760db]
[clus1:15593] [ 2]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad75aa4]
[clus1:15593] [ 3]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so
[0x2aaaae2d2fdd]
[clus1:15593] [ 4]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da)
[0x2aaaaad42cb0]
[clus1:15593] [ 5]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068)
[0x2aaaaad19ca6]
[clus1:15593] [ 6]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b)
[0x2aaaaad18a55]
[clus1:15593] [ 7]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad9710e]
[clus1:15593] [ 8]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
[0x2aaaaad974bb]
[clus1:15593] [ 9]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a)
[0x2aaaaad972ad]
[clus1:15593] [10]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe)
[0x2aaaaad97166]
[clus1:15593] [11]
/home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322)
[0x2aaaaad17556]
[clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
[0x4008a3]
[clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaabd2d8a4]
[clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
[0x400799]
[clus1:15593] *** End of error message ***

The code that i've added at the *v_protocol_receiver_component.c *is (in
bold the recv command that fails):

int mca_vprotocol_receiver_request_protector(void) {
    orte_daemon_cmd_flag_t command;
    opal_buffer_t *buffer = NULL;
    int n = 1;

    command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD;

    buffer = OBJ_NEW(opal_buffer_t);
    opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD);

    orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON,
0);

    *orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer,
ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);*
    opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n,
OPAL_UINT32);
    opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n,
OPAL_UINT32);

    orte_process_info.protector.jobid =
mca_vprotocol_receiver.protector.jobid;
    orte_process_info.protector.vpid =
mca_vprotocol_receiver.protector.vpid;

    OBJ_RELEASE(buffer);

    return OMPI_SUCCESS;

The code that i've added at the *orted_comm.c *is (in bold the send command
that fails):

case ORTE_DAEMON_REQUEST_PROTECTOR_CMD:
        if (orte_debug_daemons_flag) {
            opal_output(0, "%s orted_recv: received request protector from
local proc %s",
                        ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(sender));
        }
        /* Define the protector */
        protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1;
        if (protector >= (uint32_t)orte_process_info.num_procs) {
            protector = 0;
        }

        /* Pack the protector data */
        answer = OBJ_NEW(opal_buffer_t);

        if (ORTE_SUCCESS != (ret = opal_dss.pack(answer,
&ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) {
            ORTE_ERROR_LOG(ret);
            OBJ_RELEASE(answer);
            goto CLEANUP;
        }
        if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1,
OPAL_UINT32))) {
            ORTE_ERROR_LOG(ret);
            OBJ_RELEASE(answer);
            goto CLEANUP;
        }
        if (orte_debug_daemons_flag) {
            opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n",
                        ORTE_NAME_PRINT(sender), protector);
        }

        /* Send the protector data */
        *if (0 > orte_rml.send_buffer(sender, answer,
ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {*
* ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);*
* ret = ORTE_ERR_COMM_FAILURE;*
* OBJ_RELEASE(answer);*
* goto CLEANUP;*
        }

        OBJ_RELEASE(answer);

I assume by testing that the error is in the bolded section, maybe because
i'am missing some sentence when i try to communicate, or maybe this
communication cannot be done. Any help will be appreciated.

Thanks a lot.

Hugo Meyer