Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [mpich-discuss] problem with MPI_Get_count() for very long (but legal length) messages.
From: Jed Brown (jed_at_[hidden])
Date: 2010-02-06 10:56:01


On Fri, 5 Feb 2010 14:28:40 -0600, Barry Smith <bsmith_at_[hidden]> wrote:
> To cheer you up, when I run with openMPI it runs forever sucking down
> 100% CPU trying to send the messages :-)

On my test box (x86 with 8GB memory), Open MPI (1.4.1) does complete
after several seconds, but still prints the wrong count.

MPICH2 does not actually send the message, as you can see by running the
attached code.

  # Open MPI 1.4.1, correct cols[0]
  [0] sending...
  [1] receiving...
  count -103432106, cols[0] 0

  # MPICH2 1.2.1, incorrect cols[1]
  [1] receiving...
  [0] sending...
  [1] count -103432106, cols[0] 1

How much memory does crush have (you need about 7GB to do this without
swapping)? In particular, most of the time it took Open MPI to send the
message (with your source) was actually just spent faulting the
send/recv buffers. The attached faults the buffers first, and the
subsequent send/recv takes less than 2 seconds.

Actually, it's clear that MPICH2 never touches either buffer because it
returns immediately regardless of whether they have been faulted first.

Jed



  • text/x-csrc attachment: stored