Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Communication Failure with orted_comm.c
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-03-08 10:45:55


Yes, after the release is a break. I'm sending now all my output, maybe that
helps more. But the code is basically the one i sent. The normal execution
reaches to the send/receive between the orted_comm and the receiver.

Best regards.

Hugo

2011/3/8 Ralph Castain <rhc_at_[hidden]>

> The comm can most certainly be done - there are other sections of that code
> that also send messages. I can't see the end of your new code section, but I
> assume you ended it properly with a "break"? Otherwise, you'll execute
> whatever lies below it as well.
>
>
> On Mar 8, 2011, at 8:19 AM, Hugo Meyer wrote:
>
> Yes, i set the value 31 and it is not duplicated.
>
>
> 2011/3/8 Ralph Castain <rhc_at_[hidden]>
>
>> What value did you set for this new command? Did you look at the cmds in
>> orte/mca/odls/odls_types.h to ensure you weren't using a duplicate value?
>>
>>
>> On Mar 8, 2011, at 6:15 AM, Hugo Meyer wrote:
>>
>> Hello @ll.
>>
>> I've got a problem in a communication between the*v_protocol_receiver_component.c
>> * and the *orted_comm.c. *
>>
>> In the *mca_vprotocol_receiver_component_init* i've added a request that
>> is received correctly by the *orte_daemon_process_commands *but when i
>> try to reply to the sender i get the next error:
>>
>> [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2aaaabb03d40]
>> [clus1:15593] [ 1]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
>> [0x2aaaaad760db]
>> [clus1:15593] [ 2]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
>> [0x2aaaaad75aa4]
>> [clus1:15593] [ 3]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so
>> [0x2aaaae2d2fdd]
>> [clus1:15593] [ 4]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da)
>> [0x2aaaaad42cb0]
>> [clus1:15593] [ 5]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068)
>> [0x2aaaaad19ca6]
>> [clus1:15593] [ 6]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b)
>> [0x2aaaaad18a55]
>> [clus1:15593] [ 7]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
>> [0x2aaaaad9710e]
>> [clus1:15593] [ 8]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
>> [0x2aaaaad974bb]
>> [clus1:15593] [ 9]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a)
>> [0x2aaaaad972ad]
>> [clus1:15593] [10]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe)
>> [0x2aaaaad97166]
>> [clus1:15593] [11]
>> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322)
>> [0x2aaaaad17556]
>> [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
>> [0x4008a3]
>> [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x2aaaabd2d8a4]
>> [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
>> [0x400799]
>> [clus1:15593] *** End of error message ***
>>
>>
>> The code that i've added at the *v_protocol_receiver_component.c *is (in
>> bold the recv command that fails):
>>
>> int mca_vprotocol_receiver_request_protector(void) {
>> orte_daemon_cmd_flag_t command;
>> opal_buffer_t *buffer = NULL;
>> int n = 1;
>>
>> command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD;
>>
>> buffer = OBJ_NEW(opal_buffer_t);
>> opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD);
>>
>> orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer, ORTE_RML_TAG_DAEMON,
>> 0);
>>
>> *orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer,
>> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);*
>> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n,
>> OPAL_UINT32);
>> opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n,
>> OPAL_UINT32);
>>
>> orte_process_info.protector.jobid =
>> mca_vprotocol_receiver.protector.jobid;
>> orte_process_info.protector.vpid =
>> mca_vprotocol_receiver.protector.vpid;
>>
>> OBJ_RELEASE(buffer);
>>
>> return OMPI_SUCCESS;
>>
>>
>> The code that i've added at the *orted_comm.c *is (in bold the send
>> command that fails):
>>
>> case ORTE_DAEMON_REQUEST_PROTECTOR_CMD:
>> if (orte_debug_daemons_flag) {
>> opal_output(0, "%s orted_recv: received request protector from
>> local proc %s",
>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
>> ORTE_NAME_PRINT(sender));
>> }
>> /* Define the protector */
>> protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1;
>> if (protector >= (uint32_t)orte_process_info.num_procs) {
>> protector = 0;
>> }
>>
>> /* Pack the protector data */
>> answer = OBJ_NEW(opal_buffer_t);
>>
>> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer,
>> &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) {
>> ORTE_ERROR_LOG(ret);
>> OBJ_RELEASE(answer);
>> goto CLEANUP;
>> }
>> if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1,
>> OPAL_UINT32))) {
>> ORTE_ERROR_LOG(ret);
>> OBJ_RELEASE(answer);
>> goto CLEANUP;
>> }
>> if (orte_debug_daemons_flag) {
>> opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n",
>> ORTE_NAME_PRINT(sender), protector);
>> }
>>
>> /* Send the protector data */
>> *if (0 > orte_rml.send_buffer(sender, answer,
>> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {*
>> * ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);*
>> * ret = ORTE_ERR_COMM_FAILURE;*
>> * OBJ_RELEASE(answer);*
>> * goto CLEANUP;*
>> }
>>
>> OBJ_RELEASE(answer);
>>
>>
>> I assume by testing that the error is in the bolded section, maybe because
>> i'am missing some sentence when i try to communicate, or maybe this
>> communication cannot be done. Any help will be appreciated.
>>
>> Thanks a lot.
>>
>> Hugo Meyer
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>




  • application/octet-stream attachment: output1