Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Communication Failure with orted_comm.c
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-03-08 10:19:51


Yes, i set the value 31 and it is not duplicated.

2011/3/8 Ralph Castain <rhc_at_[hidden]>

> What value did you set for this new command? Did you look at the cmds in
> orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value?
>
>
> On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote:
>
> Hello @ll.
>
> I've got a problem in a communication between the*v_protocol_receiver_component.c
> * and the *orted_comm.c. *
>
> In the *mca_vprotocol_receiver_component_init* i've added a request that
> is received correctly by the *orte_daemon_process_commands *but when i try
> to reply to the sender i get the next error:
>
> [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2aaaabb03d40]
> [clus1:15593] [ 1]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad760db]
> [clus1:15593] [ 2]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad75aa4]
> [clus1:15593] [ 3]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so
> [0x2aaaae2d2fdd]
> [clus1:15593] [ 4]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da)
> [0x2aaaaad42cb0]
> [clus1:15593] [ 5]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068)
> [0x2aaaaad19ca6]
> [clus1:15593] [ 6]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b)
> [0x2aaaaad18a55]
> [clus1:15593] [ 7]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad9710e]
> [clus1:15593] [ 8]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad974bb]
> [clus1:15593] [ 9]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a)
> [0x2aaaaad972ad]
> [clus1:15593] [10]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe)
> [0x2aaaaad97166]
> [clus1:15593] [11]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322)
> [0x2aaaaad17556]
> [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
> [0x4008a3]
> [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2aaaabd2d8a4]
> [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
> [0x400799]
> [clus1:15593] *** End of error message ***
>
>
> The code that i've added at the *v_protocol_receiver_component.c *is (in
> bold the recv command that fails):
>
> int mca_vprotocol_receiver_request_protector(void) {
> orte_daemon_cmd_flag_t command;
> opal_buffer_t *buffer = NULL;
> int n = 1;
>
> command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD;
>
> buffer = OBJ_NEW(opal_buffer_t);
> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD);
>
> orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON,
> 0);
>
> *orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer,
> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);*
> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n,
> OPAL_UINT32);
> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n,
> OPAL_UINT32);
>
> orte_process_info.protector.jobid =
> mca_vprotocol_receiver.protector.jobid;
> orte_process_info.protector.vpid =
> mca_vprotocol_receiver.protector.vpid;
>
> OBJ_RELEASE(buffer);
>
> return OMPI_SUCCESS;
>
>
> The code that i've added at the *orted_comm.c *is (in bold the send
> command that fails):
>
> case ORTE_DAEMON_REQUEST_PROTECTOR_CMD:
> if (orte_debug_daemons_flag) {
> opal_output(0, "%s orted_recv: received request protector from
> local proc %s",
> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
> ORTE_NAME_PRINT(sender));
> }
> /* Define the protector */
> protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1;
> if (protector >= (uint32_t)orte_process_info.num_procs) {
> protector = 0;
> }
>
> /* Pack the protector data */
> answer = OBJ_NEW(opal_buffer_t);
>
> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer,
> &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) {
> ORTE_ERROR_LOG(ret);
> OBJ_RELEASE(answer);
> goto CLEANUP;
> }
> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1,
> OPAL_UINT32))) {
> ORTE_ERROR_LOG(ret);
> OBJ_RELEASE(answer);
> goto CLEANUP;
> }
> if (orte_debug_daemons_flag) {
> opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n",
> ORTE_NAME_PRINT(sender), protector);
> }
>
> /* Send the protector data */
> *if (0 > orte_rml.send_buffer(sender, answer,
> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {*
> * ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);*
> * ret = ORTE_ERR_COMM_FAILURE;*
> * OBJ_RELEASE(answer);*
> * goto CLEANUP;*
> }
>
> OBJ_RELEASE(answer);
>
>
> I assume by testing that the error is in the bolded section, maybe because
> i'am missing some sentence when i try to communicate, or maybe this
> communication cannot be done. Any help will be appreciated.
>
> Thanks a lot.
>
> Hugo Meyer
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>