Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] orte question
From: Greg Watson (g.watson_at_[hidden])
Date: 2011-07-27 17:14:45


Ralph,

Looking good so far. I did notice that ompi-ps always seems to have an exit code of 243. Is that on purpose?

Greg

On Jul 25, 2011, at 4:44 PM, Ralph Castain wrote:

> r24944 - let me know how it works!
>
>
> On Jul 25, 2011, at 1:01 PM, Greg Watson wrote:
>
>> That would probably be more intuitive.
>>
>> Thanks,
>> Greg
>>
>> On Jul 25, 2011, at 2:28 PM, Ralph Castain wrote:
>>
>>> job 0 is mpirun and its daemons - I can have it ignore that job as I doubt users care :-)
>>>
>>> On Jul 25, 2011, at 12:25 PM, Greg Watson wrote:
>>>
>>>> Ralph,
>>>>
>>>> The output format looks good, but I'm not sure it's quite correct. If I run the mpirun command, I see the following:
>>>>
>>>> mpirun:47520:num nodes:1:num jobs:2
>>>> jobid:0:state:RUNNING:slots:0:num procs:0
>>>> jobid:1:state:RUNNING:slots:1:num procs:4
>>>> process:x:rank:0:pid:47522:node:greg.local:state:SYNC REGISTERED
>>>> process:x:rank:1:pid:47523:node:greg.local:state:SYNC REGISTERED
>>>> process:x:rank:2:pid:47524:node:greg.local:state:SYNC REGISTERED
>>>> process:x:rank:3:pid:47525:node:greg.local:state:SYNC REGISTERED
>>>>
>>>> Seems to indicate there are two jobs, but one of them has 0 procs. Is that expected? Not a huge problem, since I can just ignore the job with 0 procs.
>>>>
>>>> Greg
>>>>
>>>>
>>>> On Jul 23, 2011, at 6:24 PM, Ralph Castain wrote:
>>>>
>>>>> Okay, you should have it in r24929. Use:
>>>>>
>>>>> orte-ps --parseable
>>>>>
>>>>> to get the new output.
>>>>>
>>>>>
>>>>> On Jul 23, 2011, at 11:43 AM, Ralph Castain wrote:
>>>>>
>>>>>> Gar - have to eat my words a bit. The jobid requested by orte-ps is just the "local" jobid - i.e., it is expecting you to provide a number from 0-N, as I described below (copied here):
>>>>>>
>>>>>>> A jobid of 1 indicates the primary application, 2 and above would specify comm_spawned jobs.
>>>>>>
>>>>>> Not providing the jobid at all corresponds to wildcard and returns the status of all jobs under that mpirun.
>>>>>>
>>>>>> To specify which mpirun you want info on, you use the --pid option. It is this option that isn't working properly - orte-ps returns info from all mpiruns and doesn't check to provide only data from the given pid.
>>>>>>
>>>>>> I'll fix that part, and implement the parsable output.
>>>>>>
>>>>>>
>>>>>> On Jul 22, 2011, at 8:55 PM, Ralph Castain wrote:
>>>>>>
>>>>>>>
>>>>>>> On Jul 22, 2011, at 3:57 PM, Greg Watson wrote:
>>>>>>>
>>>>>>>> Hi Ralph,
>>>>>>>>
>>>>>>>> I'd like three things :-)
>>>>>>>>
>>>>>>>> a) A --report-jobid option that prints the jobid on the first line in a form that can be passed to the -jobid option on ompi-ps. Probably tagging it in the output if -tag-output is enabled (e.g. jobid:<jobid>) would be a good idea.
>>>>>>>>
>>>>>>>> b) The orte-ps command output to use the same jobid format.
>>>>>>>
>>>>>>> I started looking at the above, and found that orte-ps is just plain wrong in the way it handles jobid. The jobid consists of two fields: a 16-bit number indicating the mpirun, and a 16-bit number indicating the job within that mpirun. Unfortunately, orte-ps sends a data request to every mpirun out there instead of only to the one corresponding to that jobid.
>>>>>>>
>>>>>>> What we probably should do is have you indicate the mpirun of interest via the -pid option, and then let jobid tell us which job you want within that mpirun. A jobid of 1 indicates the primary application, 2 and above would specify comm_spawned jobs. A jobid of -1 would return the status of all jobs under that mpirun.
>>>>>>>
>>>>>>> If multiple mpiruns are being reported, then the "jobid" in the report should again be the "local" jobid within that mpirun.
>>>>>>>
>>>>>>> After all, you don't really care what the orte-internal 16-bit identifier is for that mpirun.
>>>>>>>
>>>>>>>>
>>>>>>>> c) A more easily parsable output format from ompi-ps. It doesn't need to be a full blown XML format, just something like the following would suffice:
>>>>>>>>
>>>>>>>> jobid:719585280:state:Running:slots:1:num procs:4
>>>>>>>> process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
>>>>>>>> process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
>>>>>>>> process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
>>>>>>>> process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
>>>>>>>> jobid:345346663:state:running:slots:1:num procs:2
>>>>>>>> process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
>>>>>>>> process_name:./x:rank:1:pid:6677:node:node3.com:state:Running
>>>>>>>
>>>>>>> Shouldn't be too hard to do - bunch of if-then-else statements required, though.
>>>>>>>
>>>>>>>>
>>>>>>>> I'd be happy to help with any or all of these.
>>>>>>>
>>>>>>> Appreciate the offer - let me see how hard this proves to be...
>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Greg
>>>>>>>>
>>>>>>>> On Jul 22, 2011, at 10:18 AM, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> Hmmm...well, it looks like we could have made this nicer than we did :-/
>>>>>>>>>
>>>>>>>>> If you add --report-uri to the mpirun command line, you'll get back the uri for that mpirun. This has the form of <jobid>:<uri>. As the -h option indicates:
>>>>>>>>>
>>>>>>>>> -report-uri | --report-uri <arg0>
>>>>>>>>> Printout URI on stdout [-], stderr [+], or a file
>>>>>>>>> [anything else]
>>>>>>>>>
>>>>>>>>> The "jobid" required by the orte-ps command is the one reported there. We could easily add a --report-jobid option if that makes things easier.
>>>>>>>>>
>>>>>>>>> As to the difference in how orte-ps shows the jobid...well, that's probably historical. orte-ps uses an orte utility function to print the jobid, and that utility always shows the jobid in component form. Again, could add or just use the integer version.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 22, 2011, at 7:01 AM, Greg Watson wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Does anyone know if it's possible to get the orte jobid from the mpirun command? If not, how are you supposed to get it to use with orte-ps? Also, orte-ps reports the jobid in [x,y] notation, but the jobid argument seems to be an integer. How does that work?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Greg
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel