Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] crashing on MPI_SEND -- program exited with code 021, when ~1200 cores
From: Timothy G Thompson (Timothy.G.Thompson_at_[hidden])
Date: 2010-02-01 14:10:00


Hello,

I've been using openmpi (version 1.3.2) for some time, but recently have
had greater than 1000 cores available.
My code runs fine with 1000 cores but fails when attempting to use 1200
cores.

The only information at the time of the crash is: <program exited with
code 021>.

Within the debugger I know the crash is occurring on an MPI_Send call.
After inserting printf diagnostics I know the following...

I have a master/slave application with a 'synchronization' step occurring
during initialization.
The master is using MPI_Send to send a single integer to all of the
slaves.
I see most of the slave's printing a diagnostic and then sitting on the
MPI_Recv.

Then I see the master (finally getting to the 'home-grown broadcast') and
starting to issue MPI_Send to each slave.
After (in this case) 1019 sends the crash occurs.

I'm looking for information on the cause, I'm guessing some kind of a
message-passing buffer is being overrun,
and hints on how to avoid these types of bottlenecks when running on
clusters with multiple of thousand
of cores.

thanks !!
Tim Thompson