Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes. (fwd)
From: kmuriki_at_[hidden]
Date: 2009-01-14 14:38:55


Hi Jeff,

Here is the code with a warmup broadcast of 10K real values and
actual broadcast of 100K real*8 values(different buffers):

[kmuriki_at_n0000 pub]$ more testbcast.f90
program em3d
implicit real*8 (a-h,o-z)
include 'mpif.h'
! em3d_inv main driver
! INITIALIZE MPI AND DETERMINE BOTH INDIVIDUAL PROCESSOR #
! AND THE TOTAL NUMBER OF PROCESSORS
!
integer:: Proc
real*8, allocatable:: dbuf(:)
real warmup(10000)

call MPI_INIT(ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,Proc,IERROR)
call MPI_COMM_SIZE(MPI_COMM_WORLD,Num_Proc,IERROR)

ndat=100000

!print*,'bcasting to no of tasks',num_proc
allocate(dbuf(ndat))
do i=1,ndat
   dbuf(i)=dble(i)
enddo

do i=1,10000
   warmup(i)=(i)
enddo

!print*, 'Making warmup BCAST',proc
call MPI_BCAST(warmup,10000, &
      MPI_REAL,0,MPI_COMM_WORLD,ierror)

t1=MPI_WTIME()
call MPI_BCAST(dbuf,ndat, &
      MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
!print*, 'Done with call to broadcast',proc
t2=MPI_WTIME()
write(*,*)'time for bcast',t2-t1

deallocate(dbuf)
call MPI_FINALIZE(IERROR)
end program em3d
[kmuriki_at_n0000 pub]$ !mpif90
mpif90 -o testbcast testbcast.f90
testbcast.f90(20): (col. 1) remark: LOOP WAS VECTORIZED.
testbcast.f90(24): (col. 1) remark: LOOP WAS VECTORIZED.
/global/software/centos-5.x86_64/modules/intel/fce/10.1.018/lib/libimf.so:
warning: warning: feupdateenv is not implemented and will always fail
[kmuriki_at_n0000 pub]$ !mpirun
mpirun -v -display-map -mca btl openib,self -mca mpi_leave_pinned 1
-hostfile ./hostfile.geophys -np 4 ./testbcast
[n0000.scs00:12909] Map for job: 1 Generated by mapping mode: byslot
         Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
         Data for app_context: index 0 app: ./testbcast
                 Num procs: 4
                 Argv[0]: ./testbcast
                 Env[0]: OMPI_MCA_btl=openib,self
                 Env[1]: OMPI_MCA_mpi_leave_pinned=1
                 Env[2]: OMPI_MCA_rmaps_base_display_map=1
                 Env[3]: OMPI_MCA_rds_hostfile_path=./hostfile.geophys
                 Env[4]:
OMPI_MCA_orte_precondition_transports=1e4532db63da3056-33551606203d9c19
                 Env[5]: OMPI_MCA_rds=proxy
                 Env[6]: OMPI_MCA_ras=proxy
                 Env[7]: OMPI_MCA_rmaps=proxy
                 Env[8]: OMPI_MCA_pls=proxy
                 Env[9]: OMPI_MCA_rmgr=proxy
                 Working dir:
/global/home/users/kmuriki/sample_executables/pub (user: 0)
                 Num maps: 0
         Num elements in nodes list: 2
         Mapped node:
                 Cell: 0 Nodename: n0015.geophys Launch id: -1 Username:
NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value:
[0,1,0]
                         Proc Rank: 0 Proc PID: 0 App_context index:
0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value:
[0,1,1]
                         Proc Rank: 1 Proc PID: 0 App_context index:
0

         Mapped node:
                 Cell: 0 Nodename: n0016.geophys Launch id: -1 Username:
NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value:
[0,1,2]
                         Proc Rank: 2 Proc PID: 0 App_context index:
0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value:
[0,1,3]
                         Proc Rank: 3 Proc PID: 0 App_context index:
0
  time for bcast 5.556106567382812E-003
  time for bcast 5.569934844970703E-003
  time for bcast 2.491402626037598E-002
  time for bcast 2.490019798278809E-002
[kmuriki_at_n0000 pub]$

If I reduce the warmup size from 10K to 1K, below is the output:

  time for bcast 2.994060516357422E-003
  time for bcast 2.840995788574219E-003
  time for bcast 52.0005199909210
  time for bcast 52.0438468456268

May be when I tried 1K size warmup, as the size is small it just
used copy in/copy out semantics and the RDMA buffers are not setup
so the actual bcast was slow ! and when I used 10K size warmup it
did setup RDMA buffers and hence the actual bcast was quick !.

Is it possible to see more diagnostic output from mpirun command
with any addiitional option to see if its doing copyin/copyout etc... ?
Like with Myrinet mpirun '-v' option gives a lot of diagnostic output.

Finally below are the numbers with IB and gige when I try to run
Bcast in IMB, which looks good:

[kmuriki_at_n0000 runIMB]$ mpirun -v -np 4 --mca btl openib,self -hostfile
../pub/hostfile.geophys
/global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
#---------------------------------------------------
# Date : Wed Jan 14 11:37:54 2009
# Machine : x86_64
# System : Linux
# Release : 2.6.18-92.1.18.el5
# Version : #1 SMP Wed Nov 12 09:19:49 EST 2008
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

# Calling sequence was:

# /global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 4
#----------------------------------------------------------------
        #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
             0 1000 0.05 0.05 0.05
             1 1000 11.73 11.75 11.75
             2 1000 10.38 10.40 10.39
             4 1000 10.26 10.28 10.27
             8 1000 10.43 10.45 10.44
            16 1000 10.26 10.28 10.27
            32 1000 10.46 10.48 10.47
            64 1000 10.47 10.49 10.48
           128 1000 10.41 10.43 10.42
           256 1000 11.13 11.15 11.14
           512 1000 11.30 11.31 11.31
          1024 1000 14.45 14.47 14.47
          2048 1000 26.03 26.05 26.04
          4096 1000 44.00 44.04 44.02
          8192 1000 72.21 72.28 72.26
         16384 1000 135.48 135.60 135.56
         32768 1000 297.64 297.71 297.67
         65536 640 579.20 579.37 579.28
        131072 320 1174.31 1174.81 1174.57
        262144 160 2484.21 2486.33 2485.28
        524288 80 2686.47 2695.13 2690.80
       1048576 40 5706.35 5740.59 5723.47
       2097152 20 10705.90 10761.65 10742.98
       4194304 10 21567.58 21678.50 21641.65
[kmuriki_at_n0000 runIMB]$ mpirun -v -np 4 --mca btl tcp,self -hostfile
../pub/hostfile.geophys
/global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
#---------------------------------------------------
# Date : Wed Jan 14 11:38:01 2009
# Machine : x86_64
# System : Linux
# Release : 2.6.18-92.1.18.el5
# Version : #1 SMP Wed Nov 12 09:19:49 EST 2008
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

# Calling sequence was:

# /global/home/groups/scs/tests/IMB/IMB_3.1/src/IMB-MPI1 -npmin 4 bcast

# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# Bcast

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 4
#----------------------------------------------------------------
        #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
             0 1000 0.05 0.06 0.05
             1 1000 51.23 51.31 51.25
             2 1000 49.98 50.08 50.01
             4 1000 49.93 50.08 49.97
             8 1000 51.23 51.39 51.27
            16 1000 49.92 50.04 49.96
            32 1000 49.88 50.02 49.93
            64 1000 49.94 50.07 49.99
           128 1000 50.03 50.19 50.08
           256 1000 53.46 53.62 53.53
           512 1000 62.36 62.52 62.41
          1024 1000 74.82 75.05 74.89
          2048 1000 190.87 191.09 190.98
          4096 1000 215.01 215.29 215.20
          8192 1000 285.16 285.41 285.28
         16384 1000 426.49 426.79 426.64
         32768 1000 680.94 681.29 681.16
         65536 640 1148.72 1149.69 1149.34
        131072 320 2511.92 2512.13 2512.03
        262144 160 4716.58 4717.14 4716.86
        524288 80 8010.99 8016.05 8013.21
       1048576 40 16657.90 16676.32 16667.73
       2097152 20 27720.20 27916.86 27825.34
       4194304 10 54355.69 54781.70 54585.30
[kmuriki_at_n0000 runIMB]$

thanks,
Krishna.

On Wed, 14 Jan 2009, Jeff Squyres wrote:

> On Jan 13, 2009, at 3:32 PM, kmuriki_at_[hidden] wrote:
>
>>> With IB, there's also the issue of registered memory. Open MPI v1.2.x
>>> defaults to copy in/copy out semantics (with pre-registered memory) until
>>> the message reaches a certain size, and then it uses a pipelined
>>> register/RDMA protocol. However, even with copy in/out semantics of small
>>> messages, the resulting broadcast should still be much faster than over
>>> gige.
>>> Are you using the same buffer for the warmup bcast as the actual bcast?
>>> You might try using "--mca mpi_leave_pinned 1" to see if that helps as
>>> well (will likely only help with large messages).
>>
>> I'm using different buffers for warmup and actual bcast. I tried the
>> mpi_leave_pinned 1, but did not see any difference in behaviour.
>
> In this case, you likely won't see much of a difference -- mpi_leave_pinned
> will generally only be a boost for long messages that use the same buffers
> repeatedly.
>
>> May be when ever the openmpi defaults to copy in/copy out semantics on my
>> cluster its performing very slow (than gige) but not when it uses RDMA.
>
> That would be quite surprising. I still think there's some kind of startup
> overhead going on here.
>
>>>> Surprisingly just doing two consecutive 80K byte MPI_BCASTs
>>>> performs very quick (forget about warmup and actual broadcast).
>>>> wheres as a single 80K broadcast is slow. Not sure if I'm missing
>>>> anything!.
>>> There's also the startup time and synchronization issues. Remember that
>>> although MPI_BCAST does not provide any synchronization guarantees, it
>>> could well be that the 1st bcast effectively synchronizes the processes
>>> and the 2nd one therefore runs much faster (because individual processes
>>> won't need to spend much time blocking waiting for messages because
>>> they're effectively operating in lock step after the first bcast).
>>> Benchmarking is a very tricky business; it can be extremely difficult to
>>> precisely measure exactly what you want to measure.
>>
>> My main effort here is not to benchmark my cluster but to resolve a
>> user problem, where in he complained that his bcasts are running very slow.
>> I tried to recreate the situation with a simple fortran program
>> which just performs a bcast of size similar in his code. It also performed
>> very slow (than gige) then I started increasing and decreasing the sizes
>> of bcast to observe that it performs slow only in the range 8K bytes
>> to 100K bytes.
>
>
> Can you send your modified test program (with a warmup send)?
>
> What happens if you run a benchmark like the broadcast section of IMB on TCP
> and IB?
>
> --
> Jeff Squyres
> Cisco Systems
>