On 03/17/2010 10:10 AM, Leonardo Fialho wrote:
Wow... orte_plm.signal_job points to zero. Is it correct from the PML point of view?
It might be because plm's are really only used at launch time not in MPI processes.  Note plm != pml.

--td

Leonardo

On Mar 17, 2010, at 2:52 PM, Leonardo Fialho wrote:

To clarify a little bit more: I'm calling orte_plm.signal_job from a PML component, I know that ORTE is bellow OMPI, but I think that this function could not be available, or something like this. I can't figure out where is this snprintf too, in my code there is only

    opal_output(0, "receiver: sending SIGUSR1 <%d> to RADIC Event Logger <%s>",
                SIGUSR1, ORTE_NAME_PRINT(&el_proc));
    orte_plm.signal_job(el_proc.jobid, SIGUSR1);

And the first output/printf works fine. Well... I used gdb to run the program, I can see this:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000000
0x0000000000000000 in ?? ()
(gdb) backtrace
#0  0x0000000000000000 in ?? ()
#1  0x000000010065c319 in vprotocol_receiver_eventlog_connect (el_comm=0x10065d178) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_eventlog.c:67
#2  0x000000010065ba9a in mca_vprotocol_receiver_send (buf=0x100500000, count=262144, datatype=0x100263d60, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x1002760c0) at ../../../../../../../../ompi/mca/pml/v/mca/vprotocol/receiver/vprotocol_receiver_send.c:46
#3  0x0000000100077d44 in MPI_Send ()
#4  0x0000000100000a97 in main (argc=3, argv=0x7fff5fbff0c8) at ping.c:45

The line 67 of vprotocol_receiver_eventlog.c is the orte_plm_signal_job call. After that zeros and interrogations... the signal_job function is already available? I really don't understand what means all those zeros.

Leonardo

On Mar 17, 2010, at 2:06 PM, Ralph Castain wrote:

Thanks for clarifying - guess I won't chew just yet. :-)

I still don't see in your trace where it is failing in signal_job. I didn't see the message indicating it was sending the signal cmd out in your prior debug output, and there isn't a printf in that code loop other than the debug output. Can you attach to the process and get more info?

On Mar 17, 2010, at 6:50 AM, Leonardo Fialho wrote:

Ralph don't swallow your message yet... Both jobs are not running over the same mpirun. There are two instances of mpirun in which one runs with "-report-uri ../contact.txt" and the other receives its contact info using "-ompi-server file:../contact.txt". And yes, both processes are running with plm_base_verbose activated. When a deactivate the plm_base_verbose the error is practically the same:

[aopclf:54106] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[47640,1],0]>
[aopclf:54106] *** Process received signal ***
[aopclf:54106] Signal: Segmentation fault (11)
[aopclf:54106] Signal code: Address not mapped (1)
[aopclf:54106] Failing at address: 0x0
[aopclf:54106] [ 0] 2   libSystem.B.dylib                   0x00007fff83a6eeaa _sigtramp + 26
[aopclf:54106] [ 1] 3   libSystem.B.dylib                   0x00007fff83a210b7 snprintf + 496
[aopclf:54106] [ 2] 4   mca_vprotocol_receiver.so           0x000000010065ba0a mca_vprotocol_receiver_send + 177
[aopclf:54106] [ 3] 5   libmpi.0.dylib                      0x0000000100077d44 MPI_Send + 734
[aopclf:54106] [ 4] 6   ping                                0x0000000100000a97 main + 431
[aopclf:54106] [ 5] 7   ping                                0x00000001000008e0 start + 52
[aopclf:54106] *** End of error message ***

Leonardo

On Mar 17, 2010, at 5:43 AM, Ralph Castain wrote:

I'm going to have to eat my last message. It slipped past me that your other job was started via comm_spawn. Since both "jobs" are running under the same mpirun, there shouldn't be a problem sending a signal between them.

I don't know why this would be crashing. Are you sure it is  crashing in signal_job? Your trace indicates it is crashing in a print statement, yet there is no print statement in signal_job. Or did you run this with plm_base_verbose set so that the verbose prints are trying to run (could be we have a bug in one of them)?

On Mar 16, 2010, at 6:59 PM, Leonardo Fialho wrote:

Well, thank you anyway :)

On Mar 17, 2010, at 1:54 AM, Ralph Castain wrote:

Yeah, that probably won't work. The current code isn't intended to cross jobs like that - I'm sure nobody ever tested it for that idea, and I'm pretty sure it won't support it.

I don't currently know any way to do what you are trying to do. We could extend the signal code to handle it, I would think...but I'm not sure how soon that might happen.


On Mar 16, 2010, at 6:47 PM, Leonardo Fialho wrote:

Yes... but something wrong is going on... maybe the problem is that the jobid is different than the process' jobid, I don't know.

I'm trying to send a signal to other process running under a another job. The other process jump into an accept_connect to the MPI comm. So i did a code like this (I removed verification code and comments, this is just a summary for a happy execution):

ompi_dpm.parse_port(port, &hnp_uri, &rml_uri, &el_tag);
orte_rml_base_parse_uris(rml_uri, &el_proc, NULL);
ompi_dpm.route_to_port(hnp_uri, &el_proc);
orte_plm.signal_job(el_proc.jobid, SIGUSR1);
ompi_dpm.connect_accept(MPI_COMM_SELF, 0, port, true, el_comm);

el_proc is defined as orte_process_name_t, not a pointer to this. And signal.h has been included for SIGUSR1's sake. But when the code enter in signal_job function it crashes. I'm trying to debug it just now... the crash is the following:

[Fialho-2.local:51377] receiver: looking for: radic_eventlog[0]
[Fialho-2.local:51377] receiver: found port <784793600.0;tcp://192.168.1.200:54071+784793601.0;tcp://192.168.1.200:54072:300>
[Fialho-2.local:51377] receiver: HNP URI <784793600.0;tcp://192.168.1.200:54071>, RML URI <784793601.0;tcp://192.168.1.200:54072>, TAG <300>
[Fialho-2.local:51377] receiver: sending SIGUSR1 <30> to RADIC Event Logger <[[11975,1],0]>
[Fialho-2:51377] *** Process received signal ***
[Fialho-2:51377] Signal: Segmentation fault (11)
[Fialho-2:51377] Signal code: Address not mapped (1)
[Fialho-2:51377] Failing at address: 0x0
[Fialho-2:51377] [ 0] 2   libSystem.B.dylib                   0x00007fff83a6eeaa _sigtramp + 26
[Fialho-2:51377] [ 1] 3   libSystem.B.dylib                   0x00007fff83a210b7 snprintf + 496
[Fialho-2:51377] [ 2] 4   mca_vprotocol_receiver.so           0x000000010065ba0a mca_vprotocol_receiver_send + 177
[Fialho-2:51377] [ 3] 5   libmpi.0.dylib                      0x0000000100077d44 MPI_Send + 734
[Fialho-2:51377] [ 4] 6   ping                                0x0000000100000a97 main + 431
[Fialho-2:51377] [ 5] 7   ping                                0x00000001000008e0 start + 52
[Fialho-2:51377] [ 6] 8   ???                                 0x0000000000000003 0x0 + 3
[Fialho-2:51377] *** End of error message ***

With exception to the signal_job the code works, I have tested it forcing an accept on the other process, and avoiding the signal_job. But I want to send the signal to wake-up the other side and to be able to manage multiple connect/accept.

Thanks,
Leonardo

On Mar 17, 2010, at 1:33 AM, Ralph Castain wrote:

Sure! So long as you add the include, you are okay as the ORTE layer is "below" the OMPI one.

On Mar 16, 2010, at 6:29 PM, Leonardo Fialho wrote:

Thanks Ralph, the last question... it orte_plm.signal_job exposed/available to be called by a PML component? Yes, I have the orte/mca/plm/plm.h include line.

Leonardo

On Mar 16, 2010, at 11:59 PM, Ralph Castain wrote:

It's just the orte_process_name_t jobid field. So if you have an orte_process_name_t *pname, then it would just be

orte_plm.signal_job(pname->jobid, sig)


On Mar 16, 2010, at 3:23 PM, Leonardo Fialho wrote:

Hum.... and to signal a job probably the function is orte_plm.signal_job(jobid, signal); right?

Now my dummy question is how to obtain the jobid part from an orte_proc_name_t variable? Is there any magical function in the names_fns.h?

Thanks,
Leonardo

On Mar 16, 2010, at 10:12 PM, Ralph Castain wrote:

Afraid not - you can signal a job, but not a specific process. We used to have such an API, but nobody ever used it. Easy to restore if someone has a need.

On Mar 16, 2010, at 2:45 PM, Leonardo Fialho wrote:

Hi,

Is there any function in Open MPI's frameworks to send a signal to other ORTE proc?

For example, the ORTE process [[1234,1],1] want to  send a signal to process [[1234,1,4] locate in other node. I'm looking for this kind of functions but I just found functions to send signal to all procs in a node.

Thanks,
Leonardo
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________ devel mailing list devel@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel