Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-02-09 07:33:56


On Feb 9, 2012, at 7:15 AM, nadia.derbey_at_[hidden] wrote:

> > By default, hwloc only shows what's inside the current cpuset. There's
> > an option to show everything instead (topology flag).
>
> So may be using that flag inside opal_paffinity_base_get_processor_info() would be a better fix than the one I'm proposing in my patch.

Is this trunk, or v1.5/1.6? (or both?)

Perhaps the "good enough" fix for v1.5/1.6 is what you suggested.

But a better fix for the trunk is to use hwloc directly -- after all, paffinity/maffinity is going to go away in the not-distant future (in favor of 100% using hwloc's API).

That being said, it looks like opal_hwloc_topology is *not* loaded with HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM. I think the assumption was that we wanted to look at our little foxhole to see exactly where we were bound.

I honestly forget -- if we don't set WHOLE_SYSTEM, does the reported tree only include PUs/etc. in the current cpuset? I.e., some objects may be not in the tree altogether? The hwloc docs talk about what happens to the cpuset fields in a given object when WHOLE_SYSTEM is set/not set, but it isn't entirely clear on this point.

FWIW, it looks like we're not setting any topology IO flags, either (most likely due to the fact that we brought in hwloc when it was 1.2.x; i.e., before it supported PCI devices). I'm guessing we should probably set HWLOC_TOPOLOGY_FLAG_WHOLE_IO in all cases.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/