Have you thought about trying out MPI_Scatter/Gather and at least seeing how efficient the internal algorithms are?
If you are always going to be running on the same platform and want to tune-n-tweak for that, then have at it. If you are going to run this code on different platforms w/ different network architectures then I would be concerned about the performance "portability". In other words a solution that ran well on one cluster may not run well on another, due to a number of factors.
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Toon Knapen
Sent: Monday, January 31, 2011 5:05 AM
To: Open MPI Users
Subject: Re: [OMPI users] maximising bandwidth
So when you say you want your master to send "as fast as possible", I suppose you meant get back to running your code as soon as possible. In that case you would want nonblocking. However when you say you want the slaves to receive data faster, it seems you're implying the actual data transmission across the network. I believe the data transmission speed is not dependent on whether the it is blocking or nonblocking.
Sorry I did not express myself clearly. With 'as fast as possible' I meant that I want to have all data ASAP available in my slave nodes. The master has nothing to do but sending so I do not care if the sends are blocking or non-blocking. Actually the master will use seperate threads for the sending anyway so either I launch a thread per blocking-send or just 1 thread to do all the sending using nonblocking sends.
I do think there is plenty of reason for a difference (in the timing for receiving the data in the slaves). If OpenMPI is not able to offload the sending to some dedicated card (which in my case is probably the case since I'm on a stock linux with stock ethernet cards) and OpenMPI will try to send the data that it was requested to send by multiple nonblocking send's simultaneously, OpenMPI itself probably needs to multi-thread the sending of each message himself.
Well, I do not know anything about the internals of OpenMPI so I actually have no clue how OpenMPI would do this really and how it will try to optimise the use of BW on the network.