Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Signals
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-03-17 10:15:22


On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
> Wow... orte_plm.signal_job points to zero. Is it correct from the PML
> point of view?
It might be because plm's are really only used at launch time not in MPI
processes. Note plm != pml.

--td
>
> Leonardo
>
> On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:
>
>> To clarify a little bit more: I'm calling orte_plm.signal_job from a
>> PML component, I know that ORTE is bellow OMPI, but I think that this
>> function could not be available, or something like this. I can't
>> figure out where is this snprintf too, in my code there is only
>>
>> opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event
>> Logger <%s>",
>> SIGUSR1, ORTE_NAME_PRINT(&el_proc));
>> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>>
>> And the first output/printf works fine. Well... I used gdb to run the
>> program, I can see this:
>>
>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>> Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
>> 0x0000000000000000 in ?? ()
>> (gdb) backtrace
>> #0 0x0000000000000000 in ?? ()
>> #1 0x000000010065c319 in vprotocol_receiver_eventlog_connect
>> (el_comm=0x10065d178) at
>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
>> #2 0x000000010065ba9a in mca_vprotocol_receiver_send
>> (buf=0x100500000, count=262144, datatype=0x100263d60, dst=1, tag=1,
>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at
>> ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
>> #3 0x0000000100077d44 in MPI_Send ()
>> #4 0x0000000100000a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45
>>
>> The line 67 of vprotocol_receiver_eventlog.c is the
>> orte_plm_signal_job call. After that zeros and interrogations... the
>> signal_job function is already available? I really don't understand
>> what means all those zeros.
>>
>> Leonardo
>>
>> On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:
>>
>>> Thanks for clarifying - guess I won't chew just yet. :-)
>>>
>>> I still don't see in your trace where it is failing in signal_job. I
>>> didn't see the message indicating it was sending the signal cmd out
>>> in your prior debug output, and there isn't a printf in that code
>>> loop other than the debug output. Can you attach to the process and
>>> get more info?
>>>
>>> On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:
>>>
>>>> Ralph don't swallow your message yet... Both jobs are not running
>>>> over the same mpirun. There are two instances of mpirun in which
>>>> one runs with "-report-uri ../contact.txt" and the other receives
>>>> its contact info using "-ompi-server file:../contact.txt". And yes,
>>>> both processes are running with plm_base_verbose activated. When a
>>>> deactivate the plm_base_verbose the error is practically the same:
>>>>
>>>> [aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger
>>>> <[[47640,1],0]>
>>>> [aopclf:54106] *** Process received signal ***
>>>> [aopclf:54106] Signal: Segmentation fault (11)
>>>> [aopclf:54106] Signal code: Address not mapped (1)
>>>> [aopclf:54106] Failing at address: 0x0
>>>> [aopclf:54106] [ 0] 2 libSystem.B.dylib
>>>> 0x00007fff83a6eeaa _sigtramp + 26
>>>> [aopclf:54106] [ 1] 3 libSystem.B.dylib
>>>> 0x00007fff83a210b7 snprintf + 496
>>>> [aopclf:54106] [ 2] 4 mca_vprotocol_receiver.so
>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177
>>>> [aopclf:54106] [ 3] 5 libmpi.0.dylib
>>>> 0x0000000100077d44 MPI_Send + 734
>>>> [aopclf:54106] [ 4] 6 ping
>>>> 0x0000000100000a97 main + 431
>>>> [aopclf:54106] [ 5] 7 ping
>>>> 0x00000001000008e0 start + 52
>>>> [aopclf:54106] *** End of error message ***
>>>>
>>>> Leonardo
>>>>
>>>> On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:
>>>>
>>>>> I'm going to have to eat my last message. It slipped past me that
>>>>> your other job was started via comm_spawn. Since both "jobs" are
>>>>> running under the same mpirun, there shouldn't be a problem
>>>>> sending a signal between them.
>>>>>
>>>>> I don't know why this would be crashing. Are you sure it is
>>>>> crashing in signal_job? Your trace indicates it is crashing in a
>>>>> print statement, yet there is no print statement in signal_job. Or
>>>>> did you run this with plm_base_verbose set so that the verbose
>>>>> prints are trying to run (could be we have a bug in one of them)?
>>>>>
>>>>> On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:
>>>>>
>>>>>> Well, thank you anyway :)
>>>>>>
>>>>>> On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:
>>>>>>
>>>>>>> Yeah, that probably won't work. The current code isn't intended
>>>>>>> to cross jobs like that - I'm sure nobody ever tested it for
>>>>>>> that idea, and I'm pretty sure it won't support it.
>>>>>>>
>>>>>>> I don't currently know any way to do what you are trying to do.
>>>>>>> We could extend the signal code to handle it, I would
>>>>>>> think...but I'm not sure how soon that might happen.
>>>>>>>
>>>>>>>
>>>>>>> On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:
>>>>>>>
>>>>>>>> Yes... but something wrong is going on... maybe the problem is
>>>>>>>> that the jobid is different than the process' jobid, I don't know.
>>>>>>>>
>>>>>>>> I'm trying to send a signal to other process running under a
>>>>>>>> another job. The other process jump into an accept_connect to
>>>>>>>> the MPI comm. So i did a code like this (I removed verification
>>>>>>>> code and comments, this is just a summary for a happy execution):
>>>>>>>>
>>>>>>>> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
>>>>>>>> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
>>>>>>>> ompi_dpm.route_to_port(hnp_uri, &el_proc);
>>>>>>>> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
>>>>>>>> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
>>>>>>>>
>>>>>>>> el_proc is defined as orte_process_name_t, not a pointer to
>>>>>>>> this. And signal.h has been included for SIGUSR1's sake. But
>>>>>>>> when the code enter in signal_job function it crashes. I'm
>>>>>>>> trying to debug it just now... the crash is the following:
>>>>>>>>
>>>>>>>> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
>>>>>>>> [Fialho-2.local:51377] receiver: found port
>>>>>>>> <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
>>>>>>>> [Fialho-2.local:51377] receiver: HNP URI
>>>>>>>> <784793600.0;tcp://192.168.1.200:54071>, RML URI
>>>>>>>> <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
>>>>>>>> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC
>>>>>>>> Event Logger <[[11975,1],0]>
>>>>>>>> [Fialho-2:51377] *** Process received signal ***
>>>>>>>> [Fialho-2:51377] Signal: Segmentation fault (11)
>>>>>>>> [Fialho-2:51377] Signal code: Address not mapped (1)
>>>>>>>> [Fialho-2:51377] Failing at address: 0x0
>>>>>>>> [Fialho-2:51377] [ 0] 2 libSystem.B.dylib
>>>>>>>> 0x00007fff83a6eeaa _sigtramp + 26
>>>>>>>> [Fialho-2:51377] [ 1] 3 libSystem.B.dylib
>>>>>>>> 0x00007fff83a210b7 snprintf + 496
>>>>>>>> [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so
>>>>>>>> 0x000000010065ba0a mca_vprotocol_receiver_send + 177
>>>>>>>> [Fialho-2:51377] [ 3] 5 libmpi.0.dylib
>>>>>>>> 0x0000000100077d44 MPI_Send + 734
>>>>>>>> [Fialho-2:51377] [ 4] 6 ping
>>>>>>>> 0x0000000100000a97 main + 431
>>>>>>>> [Fialho-2:51377] [ 5] 7 ping
>>>>>>>> 0x00000001000008e0 start + 52
>>>>>>>> [Fialho-2:51377] [ 6] 8 ???
>>>>>>>> 0x0000000000000003 0x0 + 3
>>>>>>>> [Fialho-2:51377] *** End of error message ***
>>>>>>>>
>>>>>>>> With exception to the signal_job the code works, I have tested
>>>>>>>> it forcing an accept on the other process, and avoiding the
>>>>>>>> signal_job. But I want to send the signal to wake-up the other
>>>>>>>> side and to be able to manage multiple connect/accept.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Leonardo
>>>>>>>>
>>>>>>>> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> Sure! So long as you add the include, you are okay as the ORTE
>>>>>>>>> layer is "below" the OMPI one.
>>>>>>>>>
>>>>>>>>> On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Ralph, the last question... it orte_plm.signal_job
>>>>>>>>>> exposed/available to be called by a PML component? Yes, I
>>>>>>>>>> have the orte/mca/plm/plm.h include line.
>>>>>>>>>>
>>>>>>>>>> Leonardo
>>>>>>>>>>
>>>>>>>>>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:
>>>>>>>>>>
>>>>>>>>>>> It's just the orte_process_name_t jobid field. So if you
>>>>>>>>>>> have an orte_process_name_t *pname, then it would just be
>>>>>>>>>>>
>>>>>>>>>>> orte_plm.signal_job(pname->jobid, sig)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hum.... and to signal a job probably the function is
>>>>>>>>>>>> orte_plm.signal_job(jobid, signal); right?
>>>>>>>>>>>>
>>>>>>>>>>>> Now my dummy question is how to obtain the jobid part from
>>>>>>>>>>>> an orte_proc_name_t variable? Is there any magical function
>>>>>>>>>>>> in the names_fns.h?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Leonardo
>>>>>>>>>>>>
>>>>>>>>>>>> On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Afraid not - you can signal a job, but not a specific
>>>>>>>>>>>>> process. We used to have such an API, but nobody ever used
>>>>>>>>>>>>> it. Easy to restore if someone has a need.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there any function in Open MPI's frameworks to send a
>>>>>>>>>>>>>> signal to other ORTE proc?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For example, the ORTE process [[1234,1],1] want to send
>>>>>>>>>>>>>> a signal to process [[1234,1,4] locate in other node. I'm
>>>>>>>>>>>>>> looking for this kind of functions but I just found
>>>>>>>>>>>>>> functions to send signal to all procs in a node.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Leonardo
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel