Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with HPL while using OpenMPI 1.3.3
From: ilya zelenchuk (ilya_at_[hidden])
Date: 2009-12-29 09:19:09


Hello, Gus!

Sorry for the lack of debug info.
I have 28 nodes. Each node have 2 processors Xeon 2.4 GHz with 4 Gb RAM.
OpenMPI 1.3.3 was compiled as:
CC=icc CFLAGS=" -O3" CXX=icpc CXXFLAGS=" -O3" F77=ifort FFLAGS=" -O3"
FC=ifort FCFLAGS=" -O3" ./configure --prefix=/opt/openmpi/intel/
--enable-debug --enable-mpi-threads --disable-ipv6

2009/12/28 Gus Correa <gus_at_[hidden]>:
> Hi Ilya
>
> Did you recompile HPL with OpenMPI, or just launched the MPICH2
> executable with the OpenMPI mpiexec?
> You probably know this, but you cannot mix different MPIs at
> compile and run time.
Yes, I know this bug. I compile HPL with OpenMPI and run with this one.
For the MPICH2, I recompile HPL.

> Also, the HPL maximum problem size (N) depends on how much
> memory/RAM you have.
> If you make N too big, the arrays don't fit in the RAM,
> you get into memory paging, which is no good for MPI.
> How much RAM do you have?
4 Gb on each node. Also, I've watched meminfo through the top.
No swap.

> N=17920 would require about 3.2GB, if I did the math right.
> A rule of thumb is maximum N = sqrt(0.8 * RAM_in_bytes / 8)
> Have you tried smaller values (above 10000, but below 17920)?
> For which N does it start to break?
With 8960 work's almost fine. I have same errors just once. They
disappear after rebooting cluster :)
But if I set problem size to 11200 - errors comes again. At this point
rebooting doesn't help.

BTW: in output i have:

===
type 11 count ints 82 count disp 81 count datatype 81
ints: 81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481
MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600
224000 246400 268800 291200 313600 336000 358400 380800 403200 425600
448000 470400 492800 515200 537600 560000 582400 604800 627200 649600
672000 694400 716800 739200 761600 784000 806400 828800 851200 873600
896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200
1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400
1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600
1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800
1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936
types: (81 * MPI_DOUBLE)
type 11 count ints 82 count disp 81 count datatype 81
ints: 81 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800
2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 2800 6481
MPI_Aint: 0 22400 44800 67200 89600 112000 134400 156800 179200 201600
224000 246400 268800 291200 313600 336000 358400 380800 403200 425600
448000 470400 492800 515200 537600 560000 582400 604800 627200 649600
672000 694400 716800 739200 761600 784000 806400 828800 851200 873600
896000 918400 940800 963200 985600 1008000 1030400 1052800 1075200
1097600 1120000 1142400 1164800 1187200 1209600 1232000 1254400
1276800 1299200 1321600 1344000 1366400 1388800 1411200 1433600
1456000 1478400 1500800 1523200 1545600 1568000 1590400 1612800
1635200 1657600 1680000 1702400 1724800 1747200 1769600 1343143936
types: (81 * MPI_DOUBLE)
...
===

Interesting, but it seems that HPL running just fine. But with this
warning messages in stdout and stderr.
Also, i've running HPL with OPENMPI 1.4 - no warning and errors.

> The HPL TUNING file may help:
> http://www.netlib.org/benchmark/hpl/tuning.html
Yes, it's good one!

> Good luck.
>
> My two cents,
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
> ilya zelenchuk wrote:
>>
>> Good day, everyone!
>>
>> I have problem while running HPL benchmark with OPENMPI 1.3.3.
>> When problem size (Ns) smaller 10000 - all is good. But when I set Ns
>> to 17920 (for example) - I get errors:
>>
>> ===
>> [ums1:05086] ../../ompi/datatype/datatype_pack.h:37
>>        Pointer 0xb27752c0 size 4032 is outside [0xb27752c0,0x10aeac8] for
>>        base ptr 0xb27752c0 count 1 and data
>> [ums1:05086] Datatype 0x83a0618[] size 5735048 align 4 id 0 length 244
>> used 81
>> true_lb 0 true_ub 1318295560 (true_extent 1318295560) lb 0 ub
>> 1318295560 (extent 1318295560)
>> nbElems 716881 loops 0 flags 102 (commited )-c-----GD--[---][---]
>>   contain MPI_DOUBLE
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x0 (0) extent 8
>> (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x11800 (71680)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x23000 (143360)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x34800 (215040)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x46000 (286720)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x57800 (358400)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x69000 (430080)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x7a800 (501760)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x8c000 (573440)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x9d800 (645120)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xaf000 (716800)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xc0800 (788480)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xd2000 (860160)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xe3800 (931840)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0xf5000 (1003520)
>> extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x106800
>> (1075200) extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x118000
>> (1146880) extent 8 (size 71040)
>> --C---P-D--[ C ][FLT]     MPI_DOUBLE count 8880 disp 0x129800
>> (1218560) extent 8 (size 71040)
>> ....
>> ===
>>
>> Here is my HPL.dat:
>>
>> ===
>> HPLinpack benchmark input file
>> Innovative Computing Laboratory, University of Tennessee
>> HPL.out      output file name (if any)
>> 6            device out (6=stdout,7=stderr,file)
>> 1            # of problems sizes (N)
>> 17920        Ns
>> 1            # of NBs
>> 80           NBs
>> 0            PMAP process mapping (0=Row-,1=Column-major)
>> 1            # of process grids (P x Q)
>> 2            Ps
>> 14           Qs
>> 16.0         threshold
>> 1            # of panel fact
>> 2            PFACTs (0=left, 1=Crout, 2=Right)
>> 1            # of recursive stopping criterium
>> 4            NBMINs (>= 1)
>> 1            # of panels in recursion
>> 2            NDIVs
>> 1            # of recursive panel fact.
>> 2            RFACTs (0=left, 1=Crout, 2=Right)
>> 1            # of broadcast
>> 2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
>> 1            # of lookahead depth
>> 1            DEPTHs (>=0)
>> 2            SWAP (0=bin-exch,1=long,2=mix)
>> 64           swapping threshold
>> 0            L1 in (0=transposed,1=no-transposed) form
>> 0            U  in (0=transposed,1=no-transposed) form
>> 1            Equilibration (0=no,1=yes)
>> 8            memory alignment in double (> 0)
>> ===
>>
>> I've run HPL with this HPL.dat by using MPICH2 - work's well.
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>