In
order to overlap communication and computation I don't want to use
MPI_Wait.
Right. One thing to keep in mind is that there are two ways of
overlapping communication and computation. One is you start a send
(MPI_Isend), you do a bunch of computation while the message is being
sent, and then after the message has been sent you call MPI_Wait just
to clean up. This assumes that the MPI implementation can send a
message while control of the program has been returned to you. The
experts can give you the fine print, but my simple assertion is, "This
doesn't usually happen."
Rather, the MPI implementation typically will send data only when your
code is in some MPI call. That's why you have to call MPI_Test
periodically... or some other MPI function.
For
sure the message is being decomposed into chucks and the size of chuck
is probably defined by environment variable.
Maybe
do you know how can I control size of chuck?
I don't. Try running "ompi_info -a" and looking through the
parameters. For the shared-memory BTL, it's mca_btl_sm_max_frag_size.
I also see something like coll_sm_fragment_size. Maybe look at the
parameters that have "btl_openib_max" in their names.