Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes.
From: kmuriki_at_[hidden]
Date: 2009-01-09 20:47:08


Hello there,

We have a DDR IB cluster with Open MPI ver 1.2.8.
I'm testing on two nodes with two processors each and both
the nodes are adjacent (2 hops distant) on the same leaf
of the tree interconnect.

I observe that when I try to MPI_BCAST among the four MPI
tasks it takes a lot of time with IB network (more than
the GiGE network) when the payload sizes range from 24K bytes
to 800K bytes.

For payloads below 8K bytes and above 200K bytes the performance
is acceptable.

Any suggestions on how I debug this and locate the source of
the problem ? (More info below) Please let me know if you need
any more information from my side.

thanks for your time,
Krishna Muriki,
HPC User Services,
Scientific Cluster Support,
Lawrence Berkeley National Laboratory.

I) Payload size 8M bytes over IB:

[kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl openib,self -np 4
-hostfile hostfile.lr ./testbcast.8000000
[n0005.scs00:13902] Map for job: 1 Generated by mapping mode: byslot
         Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
         Data for app_context: index 0 app: ./testbcast.8000000
                 Num procs: 4
                 Argv[0]: ./testbcast.8000000
                 Env[0]: OMPI_MCA_btl=openib,self
                 Env[1]: OMPI_MCA_rmaps_base_display_map=1
                 Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
                 Env[3]:
OMPI_MCA_orte_precondition_transports=1405b3b501aa4086-00dbc7151c7348e1
                 Env[4]: OMPI_MCA_rds=proxy
                 Env[5]: OMPI_MCA_ras=proxy
                 Env[6]: OMPI_MCA_rmaps=proxy
                 Env[7]: OMPI_MCA_pls=proxy
                 Env[8]: OMPI_MCA_rmgr=proxy
                 Working dir: /global/home/users/kmuriki/sample_executables/pub
(user: 0)
                 Num maps: 0
         Num elements in nodes list: 2
         Mapped node:
                 Cell: 0 Nodename: n0172.lr Launch id: -1 Username: NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
                         Proc Rank: 0 Proc PID: 0 App_context index: 0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
                         Proc Rank: 1 Proc PID: 0 App_context index: 0

         Mapped node:
                 Cell: 0 Nodename: n0173.lr Launch id: -1 Username: NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
                         Proc Rank: 2 Proc PID: 0 App_context index: 0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
                         Proc Rank: 3 Proc PID: 0 App_context index: 0
  About to call broadcast 3
  About to call broadcast 1
  About to call broadcast 2
  About to call broadcast 0
  Done with call to broadcast 2
  time for bcast 0.133496046066284
  Done with call to broadcast 3
  time for bcast 0.148098945617676
  Done with call to broadcast 0
  time for bcast 0.113168954849243
  Done with call to broadcast 1
  time for bcast 0.145189046859741
[kmuriki_at_n0005 pub]$

II) Payload size 80K bytes using GiGE Network:

[kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl tcp,self -np 4 -hostfile
hostfile.lr ./testbcast.80000
[n0005.scs00:13928] Map for job: 1 Generated by mapping mode: byslot
         Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
         Data for app_context: index 0 app: ./testbcast.80000
                 Num procs: 4
                 Argv[0]: ./testbcast.80000
                 Env[0]: OMPI_MCA_btl=tcp,self
                 Env[1]: OMPI_MCA_rmaps_base_display_map=1
                 Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
                 Env[3]:
OMPI_MCA_orte_precondition_transports=305b93d4acc82685-12bbf20d2e6d250b
                 Env[4]: OMPI_MCA_rds=proxy
                 Env[5]: OMPI_MCA_ras=proxy
                 Env[6]: OMPI_MCA_rmaps=proxy
                 Env[7]: OMPI_MCA_pls=proxy
                 Env[8]: OMPI_MCA_rmgr=proxy
                 Working dir: /global/home/users/kmuriki/sample_executables/pub
(user: 0)
                 Num maps: 0
         Num elements in nodes list: 2
         Mapped node:
                 Cell: 0 Nodename: n0172.lr Launch id: -1 Username: NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
                         Proc Rank: 0 Proc PID: 0 App_context index: 0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
                         Proc Rank: 1 Proc PID: 0 App_context index: 0

         Mapped node:
                 Cell: 0 Nodename: n0173.lr Launch id: -1 Username: NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
                         Proc Rank: 2 Proc PID: 0 App_context index: 0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
                         Proc Rank: 3 Proc PID: 0 App_context index: 0
  About to call broadcast 0
  About to call broadcast 2
  About to call broadcast 1
  Done with call to broadcast 2
  time for bcast 7.137393951416016E-002
  About to call broadcast 3
  Done with call to broadcast 3
  time for bcast 1.110005378723145E-002
  Done with call to broadcast 0
  time for bcast 7.121706008911133E-002
  Done with call to broadcast 1
  time for bcast 3.379988670349121E-002
[kmuriki_at_n0005 pub]$

III) Payload size 80K bytes using IB Network:

[kmuriki_at_n0005 pub]$ mpirun -v -display-map --mca btl openib,self -np 4
-hostfile hostfile.lr ./testbcast.80000
[n0005.scs00:13941] Map for job: 1 Generated by mapping mode: byslot
         Starting vpid: 0 Vpid range: 4 Num app_contexts: 1
         Data for app_context: index 0 app: ./testbcast.80000
                 Num procs: 4
                 Argv[0]: ./testbcast.80000
                 Env[0]: OMPI_MCA_btl=openib,self
                 Env[1]: OMPI_MCA_rmaps_base_display_map=1
                 Env[2]: OMPI_MCA_rds_hostfile_path=hostfile.lr
                 Env[3]:
OMPI_MCA_orte_precondition_transports=4cdb5ae2babe9010-709842ac574605f9
                 Env[4]: OMPI_MCA_rds=proxy
                 Env[5]: OMPI_MCA_ras=proxy
                 Env[6]: OMPI_MCA_rmaps=proxy
                 Env[7]: OMPI_MCA_pls=proxy
                 Env[8]: OMPI_MCA_rmgr=proxy
                 Working dir: /global/home/users/kmuriki/sample_executables/pub
(user: 0)
                 Num maps: 0
         Num elements in nodes list: 2
         Mapped node:
                 Cell: 0 Nodename: n0172.lr Launch id: -1 Username: NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,0]
                         Proc Rank: 0 Proc PID: 0 App_context index: 0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,1]
                         Proc Rank: 1 Proc PID: 0 App_context index: 0

         Mapped node:
                 Cell: 0 Nodename: n0173.lr Launch id: -1 Username: NULL
                 Daemon name:
                         Data type: ORTE_PROCESS_NAME Data Value: NULL
                 Oversubscribed: False Num elements in procs list: 2
                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,2]
                         Proc Rank: 2 Proc PID: 0 App_context index: 0

                 Mapped proc:
                         Proc Name:
                         Data type: ORTE_PROCESS_NAME Data Value: [0,1,3]
                         Proc Rank: 3 Proc PID: 0 App_context index: 0
  About to call broadcast 0
  About to call broadcast 3
  About to call broadcast 1
  Done with call to broadcast 1
  time for bcast 2.550005912780762E-002
  About to call broadcast 2
  Done with call to broadcast 2
  time for bcast 2.154898643493652E-002
  Done with call to broadcast 3
  Done with call to broadcast 0
  time for bcast 38.1956140995026
  time for bcast 38.2115209102631
[kmuriki_at_n0005 pub]$

Finally here is the fortran code I'm playing with and I'm modifying the
payload size by changing the value of the variable 'ndat':

[kmuriki_at_n0005 pub]$ more testbcast.f90
program em3d
implicit real*8 (a-h,o-z)
include 'mpif.h'
! em3d_inv main driver
! INITIALIZE MPI AND DETERMINE BOTH INDIVIDUAL PROCESSOR #
! AND THE TOTAL NUMBER OF PROCESSORS
!
integer:: Proc
real*8, allocatable:: dbuf(:)

call MPI_INIT(ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD,Proc,IERROR)
call MPI_COMM_SIZE(MPI_COMM_WORLD,Num_Proc,IERROR)

ndat=1000000

!print*,'bcasting to no of tasks',num_proc
allocate(dbuf(ndat))
do i=1,ndat
   dbuf(i)=dble(i)
enddo

print*, 'About to call broadcast',proc
t1=MPI_WTIME()
call MPI_BCAST(dbuf,ndat, &
      MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
print*, 'Done with call to broadcast',proc
t2=MPI_WTIME()
write(*,*)'time for bcast',t2-t1

deallocate(dbuf)
call MPI_FINALIZE(IERROR)
end program em3d
[kmuriki_at_n0005 pub]$