On Dec 18, 2013, at 10:32 AM, Dave Love <d.love_at_[hidden]> wrote:
> Noam Bernstein <noam.bernstein_at_[hidden]> writes:
>> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some
>> collective communication), but now I'm wondering whether I should just test
> What bug, exactly? As you mentioned vasp, is it specifically affecting
Yes - I never characterized it fully, but we attached with gdb to every
single vasp running process, and all were stuck in the same
call to MPI_allreduce() every time. It's only happening on a rather large
jobs, so it's not the easiest setup to debug.
If I can reproduce the problem with 1.6.5, and I can confirm that it's always
locking up in the same call to mpi_allreduce, and all processes are stuck
in the same call, is there interest in looking into a possible mpi issue?
Given that 1.7.3 seems to be working now, whether 1.6.x works is a bit of a moot
point for us (although I just realized that I should check that it works with 1.7.3 even
with --bind-to core).
- application/pkcs7-signature attachment: smime.p7s