Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes.
From: kmuriki_at_[hidden]
Date: 2009-01-12 14:50:33


Hi Jeff,

Thanks for your response.
Is there is any requirement on the size of the data buffers
I should use in these warmup broadcasts ? If I use small
buffers like 1000 real values during warmup, the following
actual and timed MPI_BCAST over IB is taking a lot of time
(more than that on GiGE). If I use a bigger buffer of 10000 real
values during warmup the following timed MPI_BCAST is quick.

Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.

Thanks for you time and suggestions,
--Krishna.

On Mon, 12 Jan 2009, Jeff Squyres wrote:

> You might want to do some "warmup" bcasts before doing your timing
> measurements.
>
> Open MPI makes network connections lazily, meaning that we only make
> connections upon the first send (e.g., the sends underneath the MPI_BCAST).
> So the first MPI_BCAST is likely to be quite slow, while all the IB network
> connections are being made. Subsequent bcasts are likely to be much faster.
>
>
> On Jan 9, 2009, at 8:47 PM, kmuriki_at_[hidden] wrote:
>
>>
>> Hello there,
>>
>> We have a DDR IB cluster with Open MPI ver 1.2.8.
>> I'm testing on two nodes with two processors each and both
>> the nodes are adjacent (2 hops distant) on the same leaf
>> of the tree interconnect.
>>
>> I observe that when I try to MPI_BCAST among the four MPI
>> tasks it takes a lot of time with IB network (more than
>> the GiGE network) when the payload sizes range from 24K bytes
>> to 800K bytes.
>>
>> For payloads below 8K bytes and above 200K bytes the performance
>> is acceptable.
>>
>> Any suggestions on how I debug this and locate the source of
>> the problem ? (More info below) Please let me know if you need
>> any more information from my side.
>>
>> thanks for your time,
>> Krishna Muriki,
>> HPC User Services,
>> Scientific Cluster Support,
>> Lawrence Berkeley National Laboratory.
>>
>> I) Payload size 8M bytes over IB:
>>
>> [kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl openib,self -np 4
>> -hostfile hostfile.lr ./testbcast.8000000
>> [n0005.scs00:13902] Map for job: 1 Generated by mapping mode: byslot
>> Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
>> Data for app_context: index 0 app: ./testbcast.8000000
>> Num procs: 4
>> Argv[0]: ./testbcast.8000000
>> Env[0]: OMPI_MCA_btl=openib,self
>> Env[1]: OMPI_MCA_rmaps_base_display_map=1
>> Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
>> Env[3]:
>> OMPI_MCA_orte_precondition_transports=1405b3b501aa4086-00dbc7151c7348e1
>> Env[4]: OMPI_MCA_rds=proxy
>> Env[5]: OMPI_MCA_ras=proxy
>> Env[6]: OMPI_MCA_rmaps=proxy
>> Env[7]: OMPI_MCA_pls=proxy
>> Env[8]: OMPI_MCA_rmgr=proxy
>> Working dir:
>> /global/home/users/kmuriki/sample_executables/pub (user: 0)
>> Num maps: 0
>> Num elements in nodes list: 2
>> Mapped node:
>> Cell: 0 Nodename: n0172.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
>> Proc Rank: 0 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
>> Proc Rank: 1 Proc PID: 0 App_context index: 0
>>
>> Mapped node:
>> Cell: 0 Nodename: n0173.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
>> Proc Rank: 2 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
>> Proc Rank: 3 Proc PID: 0 App_context index: 0
>> About to call broadcast 3
>> About to call broadcast 1
>> About to call broadcast 2
>> About to call broadcast 0
>> Done with call to broadcast 2
>> time for bcast 0.133496046066284
>> Done with call to broadcast 3
>> time for bcast 0.148098945617676
>> Done with call to broadcast 0
>> time for bcast 0.113168954849243
>> Done with call to broadcast 1
>> time for bcast 0.145189046859741
>> [kmuriki_at_n0005 pub]$
>>
>>
>> II) Payload size 80K bytes using GiGE Network:
>>
>> [kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl tcp,self -np 4
>> -hostfile hostfile.lr ./testbcast.80000
>> [n0005.scs00:13928] Map for job: 1 Generated by mapping mode: byslot
>> Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
>> Data for app_context: index 0 app: ./testbcast.80000
>> Num procs: 4
>> Argv[0]: ./testbcast.80000
>> Env[0]: OMPI_MCA_btl=tcp,self
>> Env[1]: OMPI_MCA_rmaps_base_display_map=1
>> Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
>> Env[3]:
>> OMPI_MCA_orte_precondition_transports=305b93d4acc82685-12bbf20d2e6d250b
>> Env[4]: OMPI_MCA_rds=proxy
>> Env[5]: OMPI_MCA_ras=proxy
>> Env[6]: OMPI_MCA_rmaps=proxy
>> Env[7]: OMPI_MCA_pls=proxy
>> Env[8]: OMPI_MCA_rmgr=proxy
>> Working dir:
>> /global/home/users/kmuriki/sample_executables/pub (user: 0)
>> Num maps: 0
>> Num elements in nodes list: 2
>> Mapped node:
>> Cell: 0 Nodename: n0172.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
>> Proc Rank: 0 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
>> Proc Rank: 1 Proc PID: 0 App_context index: 0
>>
>> Mapped node:
>> Cell: 0 Nodename: n0173.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
>> Proc Rank: 2 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
>> Proc Rank: 3 Proc PID: 0 App_context index: 0
>> About to call broadcast 0
>> About to call broadcast 2
>> About to call broadcast 1
>> Done with call to broadcast 2
>> time for bcast 7.137393951416016E-002
>> About to call broadcast 3
>> Done with call to broadcast 3
>> time for bcast 1.110005378723145E-002
>> Done with call to broadcast 0
>> time for bcast 7.121706008911133E-002
>> Done with call to broadcast 1
>> time for bcast 3.379988670349121E-002
>> [kmuriki_at_n0005 pub]$
>>
>> III) Payload size 80K bytes using IB Network:
>>
>>
>> [kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl openib,self -np 4
>> -hostfile hostfile.lr ./testbcast.80000
>> [n0005.scs00:13941] Map for job: 1 Generated by mapping mode: byslot
>> Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
>> Data for app_context: index 0 app: ./testbcast.80000
>> Num procs: 4
>> Argv[0]: ./testbcast.80000
>> Env[0]: OMPI_MCA_btl=openib,self
>> Env[1]: OMPI_MCA_rmaps_base_display_map=1
>> Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
>> Env[3]:
>> OMPI_MCA_orte_precondition_transports=4cdb5ae2babe9010-709842ac574605f9
>> Env[4]: OMPI_MCA_rds=proxy
>> Env[5]: OMPI_MCA_ras=proxy
>> Env[6]: OMPI_MCA_rmaps=proxy
>> Env[7]: OMPI_MCA_pls=proxy
>> Env[8]: OMPI_MCA_rmgr=proxy
>> Working dir:
>> /global/home/users/kmuriki/sample_executables/pub (user: 0)
>> Num maps: 0
>> Num elements in nodes list: 2
>> Mapped node:
>> Cell: 0 Nodename: n0172.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
>> Proc Rank: 0 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
>> Proc Rank: 1 Proc PID: 0 App_context index: 0
>>
>> Mapped node:
>> Cell: 0 Nodename: n0173.lr Launch id: -1 Username:
>> NULL
>> Daemon name:
>> Data type: ORTE_PROCESS_NAME Data Value: NULL
>> Oversubscribed: False Num elements in procs list: 2
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
>> Proc Rank: 2 Proc PID: 0 App_context index: 0
>>
>> Mapped proc:
>> Proc Name:
>> Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
>> Proc Rank: 3 Proc PID: 0 App_context index: 0
>> About to call broadcast 0
>> About to call broadcast 3
>> About to call broadcast 1
>> Done with call to broadcast 1
>> time for bcast 2.550005912780762E-002
>> About to call broadcast 2
>> Done with call to broadcast 2
>> time for bcast 2.154898643493652E-002
>> Done with call to broadcast 3
>> Done with call to broadcast 0
>> time for bcast 38.1956140995026
>> time for bcast 38.2115209102631
>> [kmuriki_at_n0005 pub]$
>>
>> Finally here is the fortran code I'm playing with and I'm modifying the
>> payload size by changing the value of the variable 'ndat':
>>
>> [kmuriki_at_n0005 pub]$ more testbcast.f90
>> program em3d
>> implicit real*8 (a-h,o-z)
>> include 'mpif.h'
>> ! em3d_inv main driver
>> ! INITIALIZE MPI AND DETERMINE BOTH INDIVIDUAL PROCESSOR #
>> ! AND THE TOTAL NUMBER OF PROCESSORS
>> !
>> integer:: Proc
>> real*8, allocatable:: dbuf(:)
>>
>> call MPI_INIT(ierror)
>> call MPI_COMM_RANK(MPI_COMM_WORLD,Proc,IERROR)
>> call MPI_COMM_SIZE(MPI_COMM_WORLD,Num_Proc,IERROR)
>>
>> ndat=1000000
>>
>> !print*,'bcasting to no of tasks',num_proc
>> allocate(dbuf(ndat))
>> do i=1,ndat
>> dbuf(i)=dble(i)
>> enddo
>>
>> print*, 'About to call broadcast',proc
>> t1=MPI_WTIME()
>> call MPI_BCAST(dbuf,ndat, &
>> MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
>> print*, 'Done with call to broadcast',proc
>> t2=MPI_WTIME()
>> write(*,*)'time for bcast',t2-t1
>>
>> deallocate(dbuf)
>> call MPI_FINALIZE(IERROR)
>> end program em3d
>> [kmuriki_at_n0005 pub]$
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>