On Mar 6, 2007, at 4:51 PM, Michael wrote:
> MPI_Type_creat_struct performs well only when all the data is
> continuous in memory (at least for OpenMPI 1.1.2).
There are always benefits for sending contiguous data, especially
when the message is small. Packing and unpacking, are costly
operations. Even having a highly optimized version, cannot beat a
user hand pack routine when the data is small. Increase the size of
your message to over 64K and you will see another story.
> In my case the program has a f90 structure with 11 integers, 2
> logicals, and five 50 element integer arrays. But at the first stage
> of the program only the first element of those arrays are used. But
> using MPI_Type_create_struct it is more efficient to send the entire
> 263 words of continuous memory (58 sec's) than to try and send only
> 18 words of noncontinuous memory (64 sec's). At the second stage
> it's 33 words and at that stage it becomes 47 sec's vs. 163 sec's, an
> extra 116 seconds, which dominates the push of my overall wall clock
> time from 130 to 278 seconds. The third stage increases from 13
> seconds to 37 seconds.
> Because I need to send this block of data back and forward a lot I
> was hoping to find a way to speed up this data transfer of this odd
> block of data and a couple other variables. I may try PACK and
> UNPACK on the structure, but calling those lots of times can't be
> more efficient.
Is there any way I can get access to your software ? Or at least the
data-type related code ?
> ps. I don't currently have valgrind installed on this cluster and
> valgrind is not part of the Debian Linux 3.1r3 distribution. Without
> any experience with valgrind I'm not sure how useful valgrind will
> be with a MPI program of 500+ subroutines and 50K+ lines running on
> 16 processes. It took us a bit to get profiling working for the
> OpenMP version of this code.
It will be seamless. What I'm doing is the following:
instead of: mpirun -np 16 my_program my_args
I'm using: mpirun -np 16 valgrind --tool=callgrind my_program my_args
Once the execution is completed (which will usually take about 20
times more than without valgrind) I gather all resulting files on a
common location (if not already over NFS) and analyze them with
kcachegrind (comming by default with kde).
> On Mar 6, 2007, at 11:28 AM, George Bosilca wrote:
>> I doubt this come from the MPI_Pack/MPI_Unpack. The difference is 137
>> seconds for 5 calls. That's basically 27 seconds by call to MPI_Pack,
>> for packing 8 integers. I know the code and I'm affirmative there is
>> no way to spend 27 seconds over there.
>> Can you run your application using valgrind with the callgrind tool.
>> This will give you some basic informations about where the time is
>> spend. This will give us additional information about where to look.
>> On Mar 6, 2007, at 11:26 AM, Michael wrote:
>>> I have a section of code were I need to send 8 separate integers via
>>> Initially I was just putting the 8 integers into an array and then
>>> sending that array.
>>> I just tried using MPI_PACK on those 8 integers and I'm seeing a
>>> massive slow down in the code, I have a lot of other communication
>>> and this section is being used only 5 times. I went from 140
>>> to 277 seconds on 16 processors using TCP via a dual gigabit
>>> setup (I'm the only user working on this system today).
>>> This was run with OpenMPI 1.1.2 to maintain compatibility with a
>>> major HPC site.
>>> Is there a know problem with MPI_PACK/UNPACK in OpenMPI?
>>> users mailing list
>> "Half of what I say is meaningless; but I say it so that the other
>> half may reach you"
>> Kahlil Gibran
>> users mailing list
> users mailing list
"Half of what I say is meaningless; but I say it so that the other
half may reach you"