On May 26, 2012, at 10:10 AM, Eugene Loh wrote:
> I'm suspicious of some code, but would like comment from someone who understands it.
> In orte/util/nidmap.c orte_util_decode_pidmap(), one cycles through a buffer. One cycles through jobs. For each one, one unpacks num_procs. One also unpacks all sorts of other stuff like bind_idx. In particular, there's
> orte_process_info.bind_idx = bind_idx[ORTE_PROC_MY_NAME->vpid];
> Well, if we spawn a job with more processes than the parent job, we could have vpid >= num_procs and we read garbage which could and I think does lead to some less-than-enjoyable experiences later on.
Well, actually it's a bit of all three :-/
First, you have to remember that we do NOT update pidmaps in application procs. So procs in the parent job only see the initial pidmap that contains only their own job - they never see the pidmap of their children. Thus, their data is correct.
The child job will see both pidmaps. However, the values being set in orte_process_info are being overwritten each time the code parses the data for a job. Since the jobs are recorded (and hence, parsed) in order, and the last job is the one a proc actually belongs to, the values being set actually turn out to be correct.
Still, the code really isn't right (especially when we begin to update pidmaps, which is coming soon) and merited a fix. So I committed one (r26498)
> devel mailing list