Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Signals
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-03-16 20:54:44


Yeah, that probably won't work. The current code isn't intended to cross jobs like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure it won't support it.

I don't currently know any way to do what you are trying to do. We could extend the signal code to handle it, I would think...but I'm not sure how soon that might happen.

On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:

> Yes... but something wrong is going on... maybe the problem is that the jobid is different than the process' jobid, I don't know.
>
> I'm trying to send a signal to other process running under a another job. The other process jump into an accept_connect to the MPI comm. So i did a code like this (I removed verification code and comments, this is just a summary for a happy execution):
>
> ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
> orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
> ompi_dpm.route_to_port(hnp_uri, &el_proc);
> orte_plm.signal_job(el_proc.jobid, SIGUSR1);
> ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);
>
> el_proc is defined as orte_process_name_t, not a pointer to this. And signal.h has been included for SIGUSR1's sake. But when the code enter in signal_job function it crashes. I'm trying to debug it just now... the crash is the following:
>
> [Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
> [Fialho-2.local:51377] receiver: found port <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
> [Fialho-2.local:51377] receiver: HNP URI <784793600.0;tcp://192.168.1.200:54071>, RML URI <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
> [Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[11975,1],0]>
> [Fialho-2:51377] *** Process received signal ***
> [Fialho-2:51377] Signal: Segmentation fault (11)
> [Fialho-2:51377] Signal code: Address not mapped (1)
> [Fialho-2:51377] Failing at address: 0x0
> [Fialho-2:51377] [ 0] 2 libSystem.B.dylib 0x00007fff83a6eeaa _sigtramp + 26
> [Fialho-2:51377] [ 1] 3 libSystem.B.dylib 0x00007fff83a210b7 snprintf + 496
> [Fialho-2:51377] [ 2] 4 mca_vprotocol_receiver.so 0x000000010065ba0a mca_vprotocol_receiver_send + 177
> [Fialho-2:51377] [ 3] 5 libmpi.0.dylib 0x0000000100077d44 MPI_Send + 734
> [Fialho-2:51377] [ 4] 6 ping 0x0000000100000a97 main + 431
> [Fialho-2:51377] [ 5] 7 ping 0x00000001000008e0 start + 52
> [Fialho-2:51377] [ 6] 8 ??? 0x0000000000000003 0x0 + 3
> [Fialho-2:51377] *** End of error message ***
>
> With exception to the signal_job the code works, I have tested it forcing an accept on the other process, and avoiding the signal_job. But I want to send the signal to wake-up the other side and to be able to manage multiple connect/accept.
>
> Thanks,
> Leonardo
>
> On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:
>
>> Sure! So long as you add the include, you are okay as the ORTE layer is "below" the OMPI one.
>>
>> On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:
>>
>>> Thanks Ralph, the last question... it orte_plm.signal_job exposed/available to be called by a PML component? Yes, I have the orte/mca/plm/plm.h include line.
>>>
>>> Leonardo
>>>
>>> On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:
>>>
>>>> It's just the orte_process_name_t jobid field. So if you have an orte_process_name_t *pname, then it would just be
>>>>
>>>> orte_plm.signal_job(pname->jobid, sig)
>>>>
>>>>
>>>> On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:
>>>>
>>>>> Hum.... and to signal a job probably the function is orte_plm.signal_job(jobid, signal); right?
>>>>>
>>>>> Now my dummy question is how to obtain the jobid part from an orte_proc_name_t variable? Is there any magical function in the names_fns.h?
>>>>>
>>>>> Thanks,
>>>>> Leonardo
>>>>>
>>>>> On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote:
>>>>>
>>>>>> Afraid not - you can signal a job, but not a specific process. We used to have such an API, but nobody ever used it. Easy to restore if someone has a need.
>>>>>>
>>>>>> On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there any function in Open MPI's frameworks to send a signal to other ORTE proc?
>>>>>>>
>>>>>>> For example, the ORTE process [[1234,1],1] want to send a signal to process [[1234,1,4] locate in other node. I'm looking for this kind of functions but I just found functions to send signal to all procs in a node.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Leonardo
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel