Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Tim Prins (tprins_at_[hidden])
Date: 2007-08-14 10:19:49


Guillaume THOMAS-COLLIGNON wrote:
> Hi,
>
> I wrote an application which works fine on a small number of nodes
> (eg. 4), but it crashes on a large number of CPUs.
>
> In this application, all the slaves send many small messages to the
> master. I use the regular MPI_Send, and since the messages are
> relatively small (1 int, then many times 3296 ints), OpenMPI does a
> very good job at sending them asynchronously, and it maxes out the
> gigabit link on the master node. I'm very happy with this behaviour,
> it gives me the same performance as if I was doing all the
> asynchronous stuff myself, and the code remains simple.
>
> But it crashes when there are too many slaves.
How many is too many? I successfully ran your code on 96 nodes, with 4
processes per node and it seemed to work fine. Also, what network are
you using?

> So it looks like at
> some point the master node runs out of buffers and the job crashes
> brutally.
What do you mean by crashing? Is there a segfault or an error message?

Tim

> That's my understanding but I may be wrong.
> If I use explicit synchronous sends (MPI_Ssend), it does not crash
> anymore but the performance is a lot lower.
>
> I have 2 questions regarding this :
>
> 1) What kind of tuning would help handling more messages and keep the
> master from crashing ?
>
> 2) Is this the expected behaviour ? I don't think my code is doing
> anything wrong, so I would not expect a brutal crash.
>
>
> The workaround I've found so far is to do an MPI_Ssend for the
> request, then use MPI_Send for the data blocks. So all the slaves are
> blocked on the request, it keeps the master from being flooded, and
> the performance is still good. But nothing tells me it won't crash at
> some point if I have more data blocks in my real code, so I'd like to
> know more about what's happening here.
>
> Thanks,
>
> -Guillaume
>
>
> Here is the code, so you get a better idea of the communication
> scheme, or if you someone wants to reproduce the problem.
>
>
> #include <stdio.h>
> #include <stdlib.h>
>
> #include <mpi.h>
>
> #define BLOCKSIZE 3296
> #define MAXBLOCKS 1000
> #define NLOOP 4
>
> int main (int argc, char **argv) {
> int i, j, ier, rank, npes, slave, request;
> int *data;
> MPI_Status status;
>
> MPI_Init (&argc, &argv);
> MPI_Comm_rank (MPI_COMM_WORLD, &rank);
> MPI_Comm_size (MPI_COMM_WORLD, &npes);
>
> if ((data = (int *) calloc (BLOCKSIZE, sizeof (int))) == NULL)
> return -10;
>
> // Master
> if (rank == 0) {
> // Expect (NLOOP * number of slaves) requests
> for (i=0; i<(npes-1)*NLOOP; i++) {
> /* Wait for a request from any slave. Request contains number
> of data blocks */
> ier = MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, 964,
> MPI_COMM_WORLD, &status);
> if (ier != MPI_SUCCESS)
> return -1;
> slave = status.MPI_SOURCE;
> printf ("Master : request for %d blocks from slave %d\n",
> request, slave);
>
> /* Receive the data blocks from this slave */
> for (j=0; j<request; j++) {
> ier = MPI_Recv (data, BLOCKSIZE, MPI_INT, slave, 993,
> MPI_COMM_WORLD, &status);
> if (ier != MPI_SUCCESS)
> return -2;
> }
> }
> }
> // Slaves
> else {
> for (i=0; i<NLOOP; i++) {
> /* Send the request = number of blocks we want to send to the
> master */
> request = MAXBLOCKS;
> /* Changing this MPI_Send to MPI_Ssend is enough to keep the master
> from being flooded */
> ier = MPI_Send (&request, 1, MPI_INT, 0, 964, MPI_COMM_WORLD);
> if (ier != MPI_SUCCESS)
> return -3;
> /* Send the data blocks */
> for (j=0; j<request; j++) {
> ier = MPI_Send (data, BLOCKSIZE, MPI_INT, 0, 993, MPI_COMM_WORLD);
> if (ier != MPI_SUCCESS)
> return -4;
> }
> }
> }
> printf ("Node %d done\n", rank);
> MPI_Finalize ();
> }
>
>