Thanks for your detailed info. In my case, I expect to spawn multiple threads from each MPI process. I could use MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED to do so - I think MPI_THREAD_MULTIPLE is not supported on InfiniBand, which I am using. Currently, I use OpenMPI + Boost::Thread - no plan to shift to Boost::MPI yet.
I still have a couple of questions to ask:
1. In both MPI_THREAD_FUNNELED and MPI_THREAD_SERIALIZED modes, the MPI calls are serialized at only one thread (in the former case, only the rank main thread can make MPI calls, while in the latter case the threads need to be coordinated so that only one thread makes MPI calls at a time). So are there any performance implications associated with choosing between FUNNELED or SERIALIZED?
2. My current code uses many MPI collective calls (gather/scatter/broadcast, etc.). It seems that these collective calls have some negative impact on performance because ALL MPI processes need to wait on each of these calls. I would like to explore the idea of decoupling computation from MPI communication - so if one thread of each MPI rank is blocked at a MPI call, the other threads can still make progress. I am wondering if I could still make MPI calls from the other non-blocked threads using MPI_THREAD_FUNNELED or MPI_THREAD_SERIALIZED mode (assuming that the blocked thread is the main thread in the rank)?
Any advice is highly appreciated!