I've been using openmpi (version 1.3.2) for some time, but recently have
had greater than 1000 cores available.
My code runs fine with 1000 cores but fails when attempting to use 1200
The only information at the time of the crash is: <program exited with
Within the debugger I know the crash is occurring on an MPI_Send call.
After inserting printf diagnostics I know the following...
I have a master/slave application with a 'synchronization' step occurring
The master is using MPI_Send to send a single integer to all of the
I see most of the slave's printing a diagnostic and then sitting on the
Then I see the master (finally getting to the 'home-grown broadcast') and
starting to issue MPI_Send to each slave.
After (in this case) 1019 sends the crash occurs.
I'm looking for information on the cause, I'm guessing some kind of a
message-passing buffer is being overrun,
and hints on how to avoid these types of bottlenecks when running on
clusters with multiple of thousand