Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Factor of 10 loss in performance with 1.3.x
From: Steve Kargl (sgk_at_[hidden])
Date: 2009-04-06 17:39:37


On Mon, Apr 06, 2009 at 02:04:16PM -0700, Eugene Loh wrote:
> Steve Kargl wrote:
>
> >I recently upgraded OpenMPI from 1.2.9 to 1.3 and then 1.3.1.
> >One of my colleagues reported a dramatic drop in performance
> >with one of his applications. My investigation shows a factor
> >of 10 drop in communication over the memory bus. I've placed
> >a figure that iilustrates the problem at
> >
> >http://troutmask.apl.washington.edu/~kargl/ompi_cmp.jpg
> >
> >The legend in the figure has 'ver. 1.2.9 11 <--> 18'. This
> >means communication between node 11 and node 18 over GigE
> >ethernet in my cluster. 'ver. 1.2.9 20 <--> 20' means
> >communication between processes on node 20 where node 20 has
> >8 processors. The image clearly shows
> >
> Not so clearly in my mind since I have trouble discriminating between
> the colors and the overlapping lines and so on. But I'll take your word
> for it that the plot illustrates the point you are reporting.

OK. I've removed the GigE results in the graph and plotted with
points as well as lines. You'll see a red line by itself. The
green and blue lines overlap. The original data is now

http://troutmask.apl.washington.edu/~kargl/ompi_cmp_new.jpg

> It appears that you used to have just better than 1-usec latency (which
> is reasonable), but then it skyrocketed just over 10x with 1.3. I did
> some sm work, but that first appears in 1.3.2.

According to netpipe, I have

version 1.3.1
0: node20.cimu.org
1: node20.cimu.org
Latency: 0.000009131
Sync Time: 0.000018241
Now starting main loop

version 1.2.9
0: node20.cimu.org
1: node20.cimu.org
Latency: 0.000000669
Sync Time: 0.000001811

So, the latency has indeed gone up.

> The huge sm latencies are, so far as I know, inconsistent with
> everyone else's experience with 1.3. Is there any chance you
> could rebuild all three versions and really confirm that the
> observed difference can actually be attributed to differences
> in the OMPI source code? And/or run with "--mca btl
> self,sm" to make sure that the on-node message passing is indeed using sm?
>

The command lines I used are

/usr/local/openmpi-1.2.9/bin/mpicc -o z -O -static GetOpt.c netmpi.c
/usr/local/openmpi-1.2.9/bin/mpiexec -machinefile mf_ompi_2 -n 2 ./z

/usr/local/openmpi-1.3.1/bin/mpicc -o z -O -static GetOpt.c netmpi.c
/usr/local/openmpi-1.3.1/bin/mpiexec --mca btl self,sm -machinefile \
   mf_ompi_2 -n 2 ./z

There is no change in the results as can be seen at

http://troutmask.apl.washington.edu/~kargl/ompi_cmp_self.sm.jpg

The machinefile contains the single line 'node20.cimu.org slots=2'.

I can rebuild 1.2.9 and 1.3.1. Is there any particular configure
options that I should enable/disable?

-- 
Steve