On May 11, 2010, at 9:18 , Gijsbert Wiesenekker wrote:
An OpenMPI program of mine that uses MPI_Isend and MPI_Irecv crashes after some non-reproducible time my Fedora Linux kernel (invalid opcode), which makes it hard to debug (there is no trace, even with the debug kernel, and if I run it under valgrind it does not crash).
My guess is that the kernel crash is caused by OpenMPI running out if memory because too many MPI_Irecv messages have been sent but not been processed yet.
My questions are:
What does the OpenMPI specification say about the behaviour of MPI_Isend when many messages have been sent but have not been processed yet? Will it fail? Will it block until more memory becomes available (I hope not, because this would cause my program to deadlock)?
Ideally I would like to check how many MPI_Isend messages have not been processed yet, so that I can stop sending messages if there are 'too many' waiting. Is there a way to do this?
I want to let you know that this crash (you get invalid opcode: 0000  SMP painted on your screen) is specific for Fedora 12 kernel version 126.96.36.199-99.fc12.x86_64, OpenMPI 1.4.2, a lot of MPI_Isend and MPI_Irecv calls and perhaps my hardware. The same code on CentOS 5.4 kernel version 2.6.18-164.15.1.el5 runs fine.