Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] orte question
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-07-27 21:58:35


Hmmm...I'm not seeing that behavior. I get a 0 exit code every time.

You'll get a 243 if there are stale session directories laying around as it indicates that the mpirun's in those dirs are not reachable. Perhaps that is what's happening?

On Jul 27, 2011, at 3:14 PM, Greg Watson wrote:

> Ralph,
>
> Looking good so far. I did notice that ompi-ps always seems to have an exit code of 243. Is that on purpose?
>
> Greg
>
> On Jul 25, 2011, at 4:44 PM, Ralph Castain wrote:
>
>> r24944 - let me know how it works!
>>
>>
>> On Jul 25, 2011, at 1:01 PM, Greg Watson wrote:
>>
>>> That would probably be more intuitive.
>>>
>>> Thanks,
>>> Greg
>>>
>>> On Jul 25, 2011, at 2:28 PM, Ralph Castain wrote:
>>>
>>>> job 0 is mpirun and its daemons - I can have it ignore that job as I doubt users care :-)
>>>>
>>>> On Jul 25, 2011, at 12:25 PM, Greg Watson wrote:
>>>>
>>>>> Ralph,
>>>>>
>>>>> The output format looks good, but I'm not sure it's quite correct. If I run the mpirun command, I see the following:
>>>>>
>>>>> mpirun:47520:num nodes:1:num jobs:2
>>>>> jobid:0:state:RUNNING:slots:0:num procs:0
>>>>> jobid:1:state:RUNNING:slots:1:num procs:4
>>>>> process:x:rank:0:pid:47522:node:greg.local:state:SYNC REGISTERED
>>>>> process:x:rank:1:pid:47523:node:greg.local:state:SYNC REGISTERED
>>>>> process:x:rank:2:pid:47524:node:greg.local:state:SYNC REGISTERED
>>>>> process:x:rank:3:pid:47525:node:greg.local:state:SYNC REGISTERED
>>>>>
>>>>> Seems to indicate there are two jobs, but one of them has 0 procs. Is that expected? Not a huge problem, since I can just ignore the job with 0 procs.
>>>>>
>>>>> Greg
>>>>>
>>>>>
>>>>> On Jul 23, 2011, at 6:24 PM, Ralph Castain wrote:
>>>>>
>>>>>> Okay, you should have it in r24929. Use:
>>>>>>
>>>>>> orte-ps --parseable
>>>>>>
>>>>>> to get the new output.
>>>>>>
>>>>>>
>>>>>> On Jul 23, 2011, at 11:43 AM, Ralph Castain wrote:
>>>>>>
>>>>>>> Gar - have to eat my words a bit. The jobid requested by orte-ps is just the "local" jobid - i.e., it is expecting you to provide a number from 0-N, as I described below (copied here):
>>>>>>>
>>>>>>>> A jobid of 1 indicates the primary application, 2 and above would specify comm_spawned jobs.
>>>>>>>
>>>>>>> Not providing the jobid at all corresponds to wildcard and returns the status of all jobs under that mpirun.
>>>>>>>
>>>>>>> To specify which mpirun you want info on, you use the --pid option. It is this option that isn't working properly - orte-ps returns info from all mpiruns and doesn't check to provide only data from the given pid.
>>>>>>>
>>>>>>> I'll fix that part, and implement the parsable output.
>>>>>>>
>>>>>>>
>>>>>>> On Jul 22, 2011, at 8:55 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 22, 2011, at 3:57 PM, Greg Watson wrote:
>>>>>>>>
>>>>>>>>> Hi Ralph,
>>>>>>>>>
>>>>>>>>> I'd like three things :-)
>>>>>>>>>
>>>>>>>>> a) A --report-jobid option that prints the jobid on the first line in a form that can be passed to the -jobid option on ompi-ps. Probably tagging it in the output if -tag-output is enabled (e.g. jobid:<jobid>) would be a good idea.
>>>>>>>>>
>>>>>>>>> b) The orte-ps command output to use the same jobid format.
>>>>>>>>
>>>>>>>> I started looking at the above, and found that orte-ps is just plain wrong in the way it handles jobid. The jobid consists of two fields: a 16-bit number indicating the mpirun, and a 16-bit number indicating the job within that mpirun. Unfortunately, orte-ps sends a data request to every mpirun out there instead of only to the one corresponding to that jobid.
>>>>>>>>
>>>>>>>> What we probably should do is have you indicate the mpirun of interest via the -pid option, and then let jobid tell us which job you want within that mpirun. A jobid of 1 indicates the primary application, 2 and above would specify comm_spawned jobs. A jobid of -1 would return the status of all jobs under that mpirun.
>>>>>>>>
>>>>>>>> If multiple mpiruns are being reported, then the "jobid" in the report should again be the "local" jobid within that mpirun.
>>>>>>>>
>>>>>>>> After all, you don't really care what the orte-internal 16-bit identifier is for that mpirun.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> c) A more easily parsable output format from ompi-ps. It doesn't need to be a full blown XML format, just something like the following would suffice:
>>>>>>>>>
>>>>>>>>> jobid:719585280:state:Running:slots:1:num procs:4
>>>>>>>>> process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
>>>>>>>>> process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
>>>>>>>>> process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
>>>>>>>>> process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
>>>>>>>>> jobid:345346663:state:running:slots:1:num procs:2
>>>>>>>>> process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
>>>>>>>>> process_name:./x:rank:1:pid:6677:node:node3.com:state:Running
>>>>>>>>
>>>>>>>> Shouldn't be too hard to do - bunch of if-then-else statements required, though.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd be happy to help with any or all of these.
>>>>>>>>
>>>>>>>> Appreciate the offer - let me see how hard this proves to be...
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Greg
>>>>>>>>>
>>>>>>>>> On Jul 22, 2011, at 10:18 AM, Ralph Castain wrote:
>>>>>>>>>
>>>>>>>>>> Hmmm...well, it looks like we could have made this nicer than we did :-/
>>>>>>>>>>
>>>>>>>>>> If you add --report-uri to the mpirun command line, you'll get back the uri for that mpirun. This has the form of <jobid>:<uri>. As the -h option indicates:
>>>>>>>>>>
>>>>>>>>>> -report-uri | --report-uri <arg0>
>>>>>>>>>> Printout URI on stdout [-], stderr [+], or a file
>>>>>>>>>> [anything else]
>>>>>>>>>>
>>>>>>>>>> The "jobid" required by the orte-ps command is the one reported there. We could easily add a --report-jobid option if that makes things easier.
>>>>>>>>>>
>>>>>>>>>> As to the difference in how orte-ps shows the jobid...well, that's probably historical. orte-ps uses an orte utility function to print the jobid, and that utility always shows the jobid in component form. Again, could add or just use the integer version.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jul 22, 2011, at 7:01 AM, Greg Watson wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Does anyone know if it's possible to get the orte jobid from the mpirun command? If not, how are you supposed to get it to use with orte-ps? Also, orte-ps reports the jobid in [x,y] notation, but the jobid argument seems to be an integer. How does that work?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Greg
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel