Subject: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
From: adrian sabou (adrian.sabou_at_[hidden])
Date: 2012-01-31 11:16:56

Hi All,   I'm having this weird problem when running a very simple OpenMPI application. The application sends an integer from the rank 0 process to the rank 1 process. The sequence of code that I use to accomplish this is the following:         if (rank == 0)         {                 printf("Process %d - Sending...\n", rank);                 MPI_Send(&sent, 1, MPI_INT, 1, 1, MPI_COMM_WORLD);                 printf("Process %d - Sent.\n", rank);         }         if (rank == 1)         {                  printf("Process %d - Receiving...\n", rank);                 MPI_Recv(&received, 1, MPI_INT, 0, 1, MPI_COMM_WORLD, &stat);                 printf("Process %d - Received.\n", rank");         }           printf("Process %d - Barrier reached.\n", rank);         MPI_Barrier(MPI_COMM_WORLD);         printf("Process %d - Barrier passed.\n", rank");   Like I said, a very simple program. When launching this application with SLURM (using "salloc -N2 mpirun ./<my_app>"), it hangs at the barrier. However, it passes the barrier if I launch it without SLURM (using "mpirun -np 2 ./<my_app>"). I first noticed this problem when my application hanged if I tried to send two successive messages from a process to another. Only the first MPI_Send would work. The second MPI_Send would block indefinitely. I was wondering whether any of you have encountered a similar problem, or may have an ideea as to what is causing the Send/Receive pair to block when using SLURM. The exact output in my console is as follows:           salloc: Granted job allocation 1138         Process 0 - Sending...         Process 1 - Receiving...         Process 1 - Received.         Process 1 - Barrier reached.         Process 0 - Sent.         Process 0 - Barrier reached.         (it just hangs here)   I am new to MPI programming and to OpenMPI and would greatly appreciate any help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1), the operating system on the cluster on which I tried to run my application is Ubuntu 10.04 LTS Server x64. If anyone is willing to help me out, I will happily provide any other info requested (as long as the request comes with instructions on how to get that info).   Your answers will be of great help! Thanks!   Adrian