Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes.
From: kmuriki_at_[hidden]
Date: 2009-01-12 14:50:33


Hi Jeff,

Thanks for your response.
Is there is any requirement on the size of the data buffers
I should use in these warmup broadcasts ? If I use small
buffers like 1000 real values during warmup, the following
actual and timed MPI_BCAST over IB is taking a lot of time
(more than that on GiGE). If I use a bigger buffer of 10000 real
values during warmup the following timed MPI_BCAST is quick.

Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.

Thanks for you time and suggestions,
--Krishna.

On Mon, 12 Jan 2009, Jeff Squyres wrote:

> You might want to do some "warmup" bcasts before doing your timing
> measurements.
>
> Open MPI makes network connections lazily, meaning that we only make
> connections upon the first send (e.g., the sends underneath the MPI_BCAST).
> So the first MPI_BCAST is likely to be quite slow, while all the IB network
> connections are being made. Subsequent bcasts are likely to be much faster.
>
>
> On Jan 9, 2009, at 8:47 PM, kmuriki_at_[hidden] wrote:
>
>>
>> Hello there,
>>
>> We have a DDR IB cluster with Open MPI ver 1.2.8.
>> I'm testing on two nodes with two processors each and both
>> the nodes are adjacent (2 hops distant) on the same leaf
>> of the tree interconnect.
>>
>> I observe that when I try to MPI_BCAST among the four MPI
>> tasks it takes a lot of time with IB network (more than
>> the GiGE network) when the payload sizes range from 24K bytes
>> to 800K bytes.
>>
>> For payloads below 8K bytes and above 200K bytes the performance
>> is acceptable.
>>
>> Any suggestions on how I debug this and locate the source of
>> the problem ? (More info below) Please let me know if you need
>> any more information from my side.
>>
>> thanks for your time,
>> Krishna Muriki,
>> HPC User Services,
>> Scientific Cluster Support,
>> Lawrence Berkeley National Laboratory.
>>
>> I) Payload size 8M bytes over IB:
>>
>> [kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl openib,self -np 4
>> -hostfile hostfile.lr ./testbcast.8000000
>> [n0005.scs00:13902] Map for job: 1 Generated by mapping mode: byslot
>> Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
>> Data for app_context: index 0 app: ./testbcast.8000000
>> Num procs: 4
>> Argv[0]: ./testbcast.8000000
>> Env[0]: OMPI_MCA_btl=openib,self
>> Env[1]: OMPI_MCA_rmaps_base_display_map=1
>> Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
>> Env[3]:
>> OMPI_MCA_orte_precondition_transports=1405b3b501aa4086-00dbc7151c7348e1
>> Env[4]: OMPI_MCA_rds=proxy
>> Env[5]: OMPI_MCA_ras=proxy
>> Env[6]: OMPI_MCA_rmaps=proxy
>> Env[7]: OMPI_MCA_pls=proxy
>> Env[8]: OMPI_MCA_rmgr=proxy
>> Working dir:
>> /global/home/users/kmuriki/sample_executables/pub (user: 0)
>> Num maps: 0
>> Num elements in nodes list: 2
>> Mapped node:
>> Cell: 0 Nodename: n0172.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
>> Proc Rank: 0 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
>> Proc Rank: 1 Proc PID: 0 App_context index: 0
>>
>> Mapped node:
>> Cell: 0 Nodename: n0173.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
>> Proc Rank: 2 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
>> Proc Rank: 3 Proc PID: 0 App_context index: 0
>> About to call broadcast 3
>> About to call broadcast 1
>> About to call broadcast 2
>> About to call broadcast 0
>> Done with call to broadcast 2
>> time for bcast 0.133496046066284
>> Done with call to broadcast 3
>> time for bcast 0.148098945617676
>> Done with call to broadcast 0
>> time for bcast 0.113168954849243
>> Done with call to broadcast 1
>> time for bcast 0.145189046859741
>> [kmuriki_at_n0005 pub]$
>>
>>
>> II) Payload size 80K bytes using GiGE Network:
>>
>> [kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl tcp,self -np 4
>> -hostfile hostfile.lr ./testbcast.80000
>> [n0005.scs00:13928] Map for job: 1 Generated by mapping mode: byslot
>> Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
>> Data for app_context: index 0 app: ./testbcast.80000
>> Num procs: 4
>> Argv[0]: ./testbcast.80000
>> Env[0]: OMPI_MCA_btl=tcp,self
>> Env[1]: OMPI_MCA_rmaps_base_display_map=1
>> Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
>> Env[3]:
>> OMPI_MCA_orte_precondition_transports=305b93d4acc82685-12bbf20d2e6d250b
>> Env[4]: OMPI_MCA_rds=proxy
>> Env[5]: OMPI_MCA_ras=proxy
>> Env[6]: OMPI_MCA_rmaps=proxy
>> Env[7]: OMPI_MCA_pls=proxy
>> Env[8]: OMPI_MCA_rmgr=proxy
>> Working dir:
>> /global/home/users/kmuriki/sample_executables/pub (user: 0)
>> Num maps: 0
>> Num elements in nodes list: 2
>> Mapped node:
>> Cell: 0 Nodename: n0172.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
>> Proc Rank: 0 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
>> Proc Rank: 1 Proc PID: 0 App_context index: 0
>>
>> Mapped node:
>> Cell: 0 Nodename: n0173.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
>> Proc Rank: 2 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
>> Proc Rank: 3 Proc PID: 0 App_context index: 0
>> About to call broadcast 0
>> About to call broadcast 2
>> About to call broadcast 1
>> Done with call to broadcast 2
>> time for bcast 7.137393951416016E-002
>> About to call broadcast 3
>> Done with call to broadcast 3
>> time for bcast 1.110005378723145E-002
>> Done with call to broadcast 0
>> time for bcast 7.121706008911133E-002
>> Done with call to broadcast 1
>> time for bcast 3.379988670349121E-002
>> [kmuriki_at_n0005 pub]$
>>
>> III) Payload size 80K bytes using IB Network:
>>
>>
>> [kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl openib,self -np 4
>> -hostfile hostfile.lr ./testbcast.80000
>> [n0005.scs00:13941] Map for job: 1 Generated by mapping mode: byslot
>> Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
>> Data for app_context: index 0 app: ./testbcast.80000
>> Num procs: 4
>> Argv[0]: ./testbcast.80000
>> Env[0]: OMPI_MCA_btl=openib,self
>> Env[1]: OMPI_MCA_rmaps_base_display_map=1
>> Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
>> Env[3]:
>> OMPI_MCA_orte_precondition_transports=4cdb5ae2babe9010-709842ac574605f9
>> Env[4]: OMPI_MCA_rds=proxy
>> Env[5]: OMPI_MCA_ras=proxy
>> Env[6]: OMPI_MCA_rmaps=proxy
>> Env[7]: OMPI_MCA_pls=proxy
>> Env[8]: OMPI_MCA_rmgr=proxy
>> Working dir:
>> /global/home/users/kmuriki/sample_executables/pub (user: 0)
>> Num maps: 0
>> Num elements in nodes list: 2
>> Mapped node:
>> Cell: 0 Nodename: n0172.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
>> Proc Rank: 0 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
>> Proc Rank: 1 Proc PID: 0 App_context index: 0
>>
>> Mapped node:
>> Cell: 0 Nodename: n0173.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
>> Proc Rank: 2 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
>> Proc Rank: 3 Proc PID: 0 App_context index: 0
>> About to call broadcast 0
>> About to call broadcast 3
>> About to call broadcast 1
>> Done with call to broadcast 1
>> time for bcast 2.550005912780762E-002
>> About to call broadcast 2
>> Done with call to broadcast 2
>> time for bcast 2.154898643493652E-002
>> Done with call to broadcast 3
>> Done with call to broadcast 0
>> time for bcast 38.1956140995026
>> time for bcast 38.2115209102631
>> [kmuriki_at_n0005 pub]$
>>
>> Finally here is the fortran code I'm playing with and I'm modifying the
>> payload size by changing the value of the variable 'ndat':
>>
>> [kmuriki_at_n0005 pub]$ more testbcast.f90
>> program em3d
>> implicit real*8 (a-h,o-z)
>> include 'mpif.h'
>> ! em3d_inv main driver
>> ! INITIALIZE MPI AND DETERMINE BOTH INDIVIDUAL PROCESSOR #
>> ! AND THE TOTAL NUMBER OF PROCESSORS
>> !
>> integer:: Proc
>> real*8, allocatable:: dbuf(:)
>>
>> call MPI_INIT(ierror)
>> call MPI_COMM_RANK(MPI_COMM_WORLD,Proc,IERROR)
>> call MPI_COMM_SIZE(MPI_COMM_WORLD,Num_Proc,IERROR)
>>
>> ndat=1000000
>>
>> !print*,'bcasting to no of tasks',num_proc
>> allocate(dbuf(ndat))
>> do i=1,ndat
>> dbuf(i)=dble(i)
>> enddo
>>
>> print*, 'About to call broadcast',proc
>> t1=MPI_WTIME()
>> call MPI_BCAST(dbuf,ndat, &
>> MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
>> print*, 'Done with call to broadcast',proc
>> t2=MPI_WTIME()
>> write(*,*)'time for bcast',t2-t1
>>
>> deallocate(dbuf)
>> call MPI_FINALIZE(IERROR)
>> end program em3d
>> [kmuriki_at_n0005 pub]$
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>