Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Allreduce hangs
From: Martin Siegert (siegert_at_[hidden])
Date: 2012-06-27 18:32:54


On Wed, Jun 27, 2012 at 02:30:11PM -0400, Jeff Squyres wrote:
> On Jun 27, 2012, at 2:25 PM, Martin Siegert wrote:
>
> >> http://www.open-mpi.org/~jsquyres/unofficial/openmpi-1.6.1ticket3131r26612M.tar.bz2
> >
> > Thanks! I tried this and, indeed, the program (I tested quantum espresso,
> > pw.x, so far) no longer hangs.
>
> Good! We're doing a bit more definitive testing here (took a little while to figure out how to do that, but we're in process of doing that now...) before we let this go out into the wild.
>
> > Then I went one step further and benchmarked the following three cases:
> >
> > 1) pw.x compiled with openmpi-1.3.3
> > 2) pw.x compiled with openmpi-1.4.3 and
> > btl_openib_flags = 305
> > btl_openib_eager_limit = 65536
> > in etc/openmpi-mca-params.conf
> > 3) pw.x compiled with openmpi-1.6.1ticket3131r26612M
> >
> > These are the results time (in seconds) per iteration - smaller is better:
> > 1) 33.11
> > 2) 28.23
> > 3) 34.81
> >
> > That's rather disappointing, isn't it?
>
>
> Yes, it is. But #2 is not really comparable with #1 and #3. It's quite
> possible that with newer IB hardware, the eager limit should be bumped
> up by default.
>
> I leave this to Mellanox to figure out...

Good point ... I should run all three cases with the eager limit set to
65536.

However, there is another issue that may affect the performance of the 1.6.1
version. I see a LOT of the following messages on stderr:

--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to register memory in the driver.
Please check /var/log/messages or dmesg for driver specific failure
reason.
The failure occured here:

  Local host: b413
  Device: mlx4_0
  Function: openib_reg_mr()
  Errno says: Cannot allocate memory (errno=12)

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
[b414:15870] 168 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail
[b414:15870] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[b414:15870] 131 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail
[b414:15870] 8 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail
[b414:15870] 1 more process has sent help message help-mpi-btl-openib.txt / mem-reg-fail
[b414:15870] 209 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail
[b414:15870] 144 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail
...

The strange thing is that this job used 32 processors (cores). Thus, I have
no idea what the "168 more processes", etc., are refering to (there is
nothing in /var/log/messages about this).

The messages do not appear to be fatal. But nevertheless - do you know
what causes these error messages?

Cheers,
Martin