Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] slowdown with infiniband and latest CentOS kernel
From: Noam Bernstein (noam.bernstein_at_[hidden])
Date: 2013-12-18 11:47:47


On Dec 18, 2013, at 10:32 AM, Dave Love <d.love_at_[hidden]> wrote:

> Noam Bernstein <noam.bernstein_at_[hidden]> writes:
>
>> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in some
>> collective communication), but now I'm wondering whether I should just test
>> 1.6.5.
>
> What bug, exactly? As you mentioned vasp, is it specifically affecting
> that?

Yes - I never characterized it fully, but we attached with gdb to every
single vasp running process, and all were stuck in the same
call to MPI_allreduce() every time. It's only happening on a rather large
jobs, so it's not the easiest setup to debug.

If I can reproduce the problem with 1.6.5, and I can confirm that it's always
locking up in the same call to mpi_allreduce, and all processes are stuck
in the same call, is there interest in looking into a possible mpi issue?

Given that 1.7.3 seems to be working now, whether 1.6.x works is a bit of a moot
point for us (although I just realized that I should check that it works with 1.7.3 even
with --bind-to core).

                                                                        Noam


  • application/pkcs7-signature attachment: smime.p7s