Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Communication Failure with orted_comm.c
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-03-09 08:08:33


Your suggestion worked Ralph.

I only add :

OBJ_RELEASE(buffer);
buffer = OBJ_NEW(opal_buffer_t);

Thank you both for your help.

Hugo

2011/3/8 George Bosilca <bosilca_at_[hidden]>

> The stack trace indicate that your orted segfaulted in the
> orte_odls_base_notify_iof_complete which means it received a message that
> was interpreted as a ORTE_DAEMON_IOF_COMPLETE (21). Nothing more to get out
> from your output unfortunately.
>
> george.
>
> On Mar 8, 2011, at 08:15 , Hugo Meyer wrote:
>
> > Hello @ll.
> >
> > I've got a problem in a communication between the
> v_protocol_receiver_component.c and the orted_comm.c.
> >
> > In the mca_vprotocol_receiver_component_init i've added a request that
> is received correctly by the orte_daemon_process_commands but when i try to
> reply to the sender i get the next error:
> >
> > [clus1:15593] [ 0] /lib64/libpthread.so.0 [0x2aaaabb03d40]
> > [clus1:15593] [ 1]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad760db]
> > [clus1:15593] [ 2]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad75aa4]
> > [clus1:15593] [ 3]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/openmpi/mca_errmgr_orted.so
> [0x2aaaae2d2fdd]
> > [clus1:15593] [ 4]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_odls_base_notify_iof_complete+0x1da)
> [0x2aaaaad42cb0]
> > [clus1:15593] [ 5]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_process_commands+0x1068)
> [0x2aaaaad19ca6]
> > [clus1:15593] [ 6]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon_cmd_processor+0x81b)
> [0x2aaaaad18a55]
> > [clus1:15593] [ 7]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad9710e]
> > [clus1:15593] [ 8]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0
> [0x2aaaaad974bb]
> > [clus1:15593] [ 9]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_loop+0x1a)
> [0x2aaaaad972ad]
> > [clus1:15593] [10]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(opal_event_dispatch+0xe)
> [0x2aaaaad97166]
> > [clus1:15593] [11]
> /home/hmeyer/desarrollo/radic-ompi/binarios/lib/libopen-rte.so.0(orte_daemon+0x2322)
> [0x2aaaaad17556]
> > [clus1:15593] [12] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
> [0x4008a3]
> > [clus1:15593] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2aaaabd2d8a4]
> > [clus1:15593] [14] /home/hmeyer/desarrollo/radic-ompi/binarios/bin/orted
> [0x400799]
> > [clus1:15593] *** End of error message ***
> >
> > The code that i've added at the v_protocol_receiver_component.c is (in
> bold the recv command that fails):
> >
> > int mca_vprotocol_receiver_request_protector(void) {
> > orte_daemon_cmd_flag_t command;
> > opal_buffer_t *buffer = NULL;
> > int n = 1;
> >
> > command = ORTE_DAEMON_REQUEST_PROTECTOR_CMD;
> >
> > buffer = OBJ_NEW(opal_buffer_t);
> > opal_dss.pack(buffer, &command, 1, ORTE_DAEMON_CMD);
> >
> > orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, buffer,
> ORTE_RML_TAG_DAEMON, 0);
> >
> > orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, buffer,
> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0);
> > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.jobid, &n,
> OPAL_UINT32);
> > opal_dss.unpack(buffer, &mca_vprotocol_receiver.protector.vpid, &n,
> OPAL_UINT32);
> >
> > orte_process_info.protector.jobid =
> mca_vprotocol_receiver.protector.jobid;
> > orte_process_info.protector.vpid =
> mca_vprotocol_receiver.protector.vpid;
> >
> > OBJ_RELEASE(buffer);
> >
> > return OMPI_SUCCESS;
> >
> > The code that i've added at the orted_comm.c is (in bold the send command
> that fails):
> >
> > case ORTE_DAEMON_REQUEST_PROTECTOR_CMD:
> > if (orte_debug_daemons_flag) {
> > opal_output(0, "%s orted_recv: received request protector
> from local proc %s",
> > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
> ORTE_NAME_PRINT(sender));
> > }
> > /* Define the protector */
> > protector = (uint32_t)ORTE_PROC_MY_NAME->vpid + 1;
> > if (protector >= (uint32_t)orte_process_info.num_procs) {
> > protector = 0;
> > }
> >
> > /* Pack the protector data */
> > answer = OBJ_NEW(opal_buffer_t);
> >
> > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer,
> &ORTE_PROC_MY_NAME->jobid, 1, OPAL_UINT32))) {
> > ORTE_ERROR_LOG(ret);
> > OBJ_RELEASE(answer);
> > goto CLEANUP;
> > }
> > if (ORTE_SUCCESS != (ret = opal_dss.pack(answer, &protector, 1,
> OPAL_UINT32))) {
> > ORTE_ERROR_LOG(ret);
> > OBJ_RELEASE(answer);
> > goto CLEANUP;
> > }
> > if (orte_debug_daemons_flag) {
> > opal_output(0, "EL PROTECTOR ASIGNADO para %s ES: %d\n",
> > ORTE_NAME_PRINT(sender), protector);
> > }
> >
> > /* Send the protector data */
> > if (0 > orte_rml.send_buffer(sender, answer,
> ORTE_DAEMON_REQUEST_PROTECTOR_CMD, 0)) {
> > ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
> > ret = ORTE_ERR_COMM_FAILURE;
> > OBJ_RELEASE(answer);
> > goto CLEANUP;
> > }
> > OBJ_RELEASE(answer);
> >
> > I assume by testing that the error is in the bolded section, maybe
> because i'am missing some sentence when i try to communicate, or maybe this
> communication cannot be done. Any help will be appreciated.
> >
> > Thanks a lot.
> >
> > Hugo Meyer
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> "I disapprove of what you say, but I will defend to the death your right to
> say it"
> -- Evelyn Beatrice Hall
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>