On Tue, Dec 17, 2013 at 11:16:48AM -0500, Noam Bernstein wrote:
> On Dec 17, 2013, at 11:04 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> > Are you binding the procs? We don't bind by default (this will change in 1.7.4), and binding can play a significant role when comparing across kernels.
> > add "--bind-to-core" to your cmd line
> I've previously always used mpi_paffinity_alone=1, and the new behavior
> seems to be independent of whether or not I use it. I'll try bind-to-core.
That would be the problem. That variable no longer exists in 1.7.4 and
has been replaced by hwloc_base_binding_policy. --bind-to core is an
alias of -mca hwloc_base_binding_policy core.
> One more possible clue. I haven't done a full test, but for one
> particular setup (newer nodes, single node so presumably using
> sm), there are apparently two ways to fix the problem:
> 1. go back to the previous kernel, but stick with openmpi 1.7.3
> 2. stick with the new kernel, but go back to openmpi 1.6.4
> So it appears to be some interaction between the new kernel and 1.7.3 that
> isn't present with 1.6.4.
> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some
> collective communication), but now I'm wondering whether I should just test
> users mailing list
- application/pgp-signature attachment: stored