Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with HPL while using OpenMPI 1.3.3
From: Gus Correa (gus_at_[hidden])
Date: 2009-12-29 13:18:04


Hi Ilya

OK, with 28 nodes and 4GB/node,
you have much more memory than I thought.
The maximum N is calculated based on the total memory
you have (assuming the nodes are homogeneous, have the same RAM),
not based on the memory per node.

I haven't tried OpenMPI 1.3.3.
Last I ran HPL was with OpenMPI 1.3.2.
It worked fine.
It also worked with OpenMPI 1.3.1 and 1.3.0, but these versions
had a problem that caused memory leaks (at least on Infiniband, not sure
about Ethernet).
The problem was fixed in later OpenMPI versions (1.3.2. and newer).
In any case, there was even a workaround in the command line for that
("-mca mpi_leave_pinned 0") for 1.3.0 and 1.3.1.
However, AFAIK, this workaround is not needed for 1.3.2 and newer.

What is your OpenMPI mpiexec command line?

Is it possible that you somehow mixed a 32-bit machine/OpenMPI build
with a 64-bit machine/OpenMPI build.
For instance, your head node (where you compiled the code) is
64-bit, but the compute nodes - or some of them - are 32-bit,
or vice-versa?
The error messages you posted hint something like that,
a mix of MPI_DOUBLE types, MPI_Aint types, etc.

Also, make sure no mpi.h/mpif.h is hardwired into your HPL code,
or into the supporting libraries BLAS/LAPACK, or Goto BLAS,
or ATLAS, etc.
Those include files are NOT portable across MPI flavors,
and a source of frustration when hardwired into the code.

Furthermore, make sure you don't have leftover HPL processes
from old runs hanging on the compute nodes.
That is a common cause of trouble.
Would this be the reason for the problems you saw?

Good luck.

I hope it helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

ilya zelenchuk wrote:
> Hello, Gus!
>
> Sorry for the lack of debug info.
> I have 28 nodes. Each node have 2 processors Xeon 2.4 GHz with 4 Gb RAM.
> OpenMPI 1.3.3 was compiled as:
> CC=icc CFLAGS=" -O3" CXX=icpc CXXFLAGS=" -O3" F77=ifort FFLAGS=" -O3"
> FC=ifort FCFLAGS=" -O3" ./configure --prefix=/opt/openmpi/intel/
> --enable-debug --enable-mpi-threads --disable-ipv6
>
> 2009/12/28 Gus Correa <gus_at_[hidden]>:
>> Hi Ilya
>>
>> Did you recompile HPL with OpenMPI, or just launched the MPICH2
>> executable with the OpenMPI mpiexec?
>> You probably know this, but you cannot mix different MPIs at
>> compile and run time.
> Yes, I know this bug. I compile HPL with OpenMPI and run with this one.
> For the MPICH2, I recompile HPL.
>
>> Also, the HPL maximum problem size (N) depends on how much
>> memory/RAM you have.
>> If you make N too big, the arrays don't fit in the RAM,
>> you get into memory paging, which is no good for MPI.
>> How much RAM do you have?
> 4 Gb on each node. Also, I've watched meminfo through the top.
> No swap.
>
>> N=17920 would require about 3.2GB, if I did the math right.
>> A rule of thumb is maximum N = sqrt(0.8 * RAM_in_bytes / 8)
>> Have you tried smaller values (above 10000, but below 17920)?
>> For which N does it start to break?
> With 8960 work's almost fine. I have same errors just once. They
> disappear after rebooting cluster :)
> But if I set problem size to 11200 - errors comes again. At this point
> rebooting doesn't help.
>
> BTW: in output i have:
>
> ===
> type 11 count ints 82 count disp 81 count datatype 81
> ints: 81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481
> MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600
> 224000 246400 268800 291200 313600 336000 358400 380800 403200 425600
> 448000 470400 492800 515200 537600 560000 582400 604800 627200 649600
> 672000 694400 716800 739200 761600 784000 806400 828800 851200 873600
> 896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200
> 1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400
> 1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600
> 1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800
> 1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936
> types: (81 * MPI_DOUBLE)
> type 11 count ints 82 count disp 81 count datatype 81
> ints: 81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
> 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481
> MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600
> 224000 246400 268800 291200 313600 336000 358400 380800 403200 425600
> 448000 470400 492800 515200 537600 560000 582400 604800 627200 649600
> 672000 694400 716800 739200 761600 784000 806400 828800 851200 873600
> 896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200
> 1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400
> 1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600
> 1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800
> 1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936
> types: (81 * MPI_DOUBLE)
> ...
> ===
>
> Interesting, but it seems that HPL running just fine. But with this
> warning messages in stdout and stderr.
> Also, i've running HPL with OPENMPI 1.4 - no warning and errors.
>
>> The HPL TUNING file may help:
>> http://www.netlib.org/benchmark/hpl/tuning.html
> Yes, it's good one!
>
>> Good luck.
>>
>> My two cents,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>> ilya zelenchuk wrote:
>>> Good day, everyone!
>>>
>>> I have problem while running HPL benchmark with OPENMPI 1.3.3.
>>> When problem size (Ns) smaller 10000 - all is good. But when I set Ns
>>> to 17920 (for example) - I get errors:
>>>
>>> ===
>>> [ums1:05086] ../../ompi/datatype/datatype_pack.h:37
>>> Pointer 0xb27752c0 size 4032 is outside [0xb27752c0,0x10aeac8] for
>>> base ptr 0xb27752c0 count 1 and data
>>> [ums1:05086] Datatype 0x83a0618[] size 5735048 align 4 id 0 length 244
>>> used 81
>>> true_lb 0 true_ub 1318295560 (true_extent 1318295560) lb 0 ub
>>> 1318295560 (extent 1318295560)
>>> nbElems 716881 loops 0 flags 102 (commited )-c-----GD--[---][---]
>>> contain MPI_DOUBLE
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x0 (0) extent 8
>>> (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x11800 (71680)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x23000 (143360)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x34800 (215040)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x46000 (286720)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x57800 (358400)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x69000 (430080)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x7a800 (501760)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x8c000 (573440)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x9d800 (645120)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xaf000 (716800)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xc0800 (788480)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xd2000 (860160)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xe3800 (931840)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0xf5000 (1003520)
>>> extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x106800
>>> (1075200) extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x118000
>>> (1146880) extent 8 (size 71040)
>>> --C---P-D--[ C ][FLT] MPI_DOUBLE count 8880 disp 0x129800
>>> (1218560) extent 8 (size 71040)
>>> ....
>>> ===
>>>
>>> Here is my HPL.dat:
>>>
>>> ===
>>> HPLinpack benchmark input file
>>> Innovative Computing Laboratory, University of Tennessee
>>> HPL.out output file name (if any)
>>> 6 device out (6=stdout,7=stderr,file)
>>> 1 # of problems sizes (N)
>>> 17920 Ns
>>> 1 # of NBs
>>> 80 NBs
>>> 0 PMAP process mapping (0=Row-,1=Column-major)
>>> 1 # of process grids (P x Q)
>>> 2 Ps
>>> 14 Qs
>>> 16.0 threshold
>>> 1 # of panel fact
>>> 2 PFACTs (0=left, 1=Crout, 2=Right)
>>> 1 # of recursive stopping criterium
>>> 4 NBMINs (>= 1)
>>> 1 # of panels in recursion
>>> 2 NDIVs
>>> 1 # of recursive panel fact.
>>> 2 RFACTs (0=left, 1=Crout, 2=Right)
>>> 1 # of broadcast
>>> 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
>>> 1 # of lookahead depth
>>> 1 DEPTHs (>=0)
>>> 2 SWAP (0=bin-exch,1=long,2=mix)
>>> 64 swapping threshold
>>> 0 L1 in (0=transposed,1=no-transposed) form
>>> 0 U in (0=transposed,1=no-transposed) form
>>> 1 Equilibration (0=no,1=yes)
>>> 8 memory alignment in double (> 0)
>>> ===
>>>
>>> I've run HPL with this HPL.dat by using MPICH2 - work's well.
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users