Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault at program end with 2+ processes
From: Prentice Bisbal (prentice_at_[hidden])
Date: 2010-05-20 11:10:26

I hope I'm not too late in my reply, and I hope I'm not repeating the
same solution others have given you.

I had a similar error in a code a few months ago. The error was this: I
think I was doing an MPI_Pack/Unpack to send data between nodes. The
problem was that I was allocating space for a buffer using the wrong
variable, so there was a buffer size mismatch between the sending and
receiving nodes.

When running problem as a single instance, these buffers weren't really
being used, so the problem never presented itself. It trickier, the
problem only occurred when the payload exceeded a certain size (number
of elements in array, or data in packed buffer) when run in parallel.

I used valgrind, which didn't shed much light on the problem. I finally
found my error when I tracking down the data size dependency.

I hope that helps.


Jeff Squyres wrote:
> Ouch. These are the worst kinds of bugs to find. :-(
> If you attach a debugger to these processes and step through the final death throes of the process, does it provide any additional insight? I have not infrequently done stuff like this:
> {
> int i = 0;
> printf("Process %d ready to attach\n", getpid());
> while (i == 0) sleep(5);
> }
> Then you get a message indicating which pid to attach to. When you attach, set the variable i to nonzero and you can continue stepping through the process.
> On May 14, 2010, at 10:44 AM, Paul-Michael Agapow wrote:
>> Apologies for the vague details of the problem I'm about to describe,
>> but then I only understand it vaguely. Any pointers about the best
>> directions for further investigation would be appreciated. Lengthy
>> details follow:
>> So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run
>> into some weird behaviour. When run under mpiexec, a segmentation
>> fault is thrown:
>> % mpiexec -n 2 ./omegamip
>> [...]
>> main.cpp:52: Finished.
>> Completed 20 of 20 in 0.0695 minutes
>> [queen:23560] *** Process received signal ***
>> [queen:23560] Signal: Segmentation fault (11)
>> [queen:23560] Signal code: (128)
>> [queen:23560] Failing at address: (nil)
>> [queen:23560] [ 0] /lib64/ [0x3d6a00de80]
>> [queen:23560] [ 1] /opt/openmpi/lib/
>> [0x2afb1fa43460]
>> [queen:23560] [ 2] /opt/openmpi/lib/ [0x2afb1fa439ad]
>> [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b]
>> [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc]
>> [queen:23560] [ 5] /lib64/ [0x3d6941d8b4]
>> [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59]
>> [queen:23560] *** End of error message ***
>> mpiexec noticed that job rank 1 with PID 23560 on node
>> queen.bioinformatics exited on signal 11 (Segmentation fault).
>> Right, so I've got a memory overrun or something. Except that when the
>> program is run in standalone mode, it works fine:
>> % ./omegamip
>> [...]
>> main.cpp:52: Finished.
>> Completed 20 of 20 in 0.05970 minutes
>> Right, so there's a difference between my standalone and MPI modes.
>> Except the the difference between my standalone and MPI versions is
>> currently nothing but the calls to MPI_Init, MPI_Finalize and some
>> exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't
>> gotten as far as coding the problem division.) Also, calling mpiexec
>> with 1 process always works:
>> % mpiexec -n 1 ./omegamip
>> [...]
>> main.cpp:52: Finished.
>> Completed 20 of 20 in 0.05801 minutes
>> So there's still this segmentation fault. Running valgrind across the
>> program doesn't show any obvious problems: there was some quirky
>> pointer arithmetic and some huge blocks of dangling memory, but these
>> were only leaked at the end of the program (i.e. the original
>> programmer didn't bother cleaning up at program termination). I've
>> caught most of those. But the segmentation fault still occurs only
>> when run under mpiexec with 2 or more processes. And by use of
>> diagnostic printfs and logging, I can see that it only occurs at the
>> very end of the program, the very end of main, possibly when
>> destructors are being automatically called. But again this cleanup
>> doesn't cause any problems with the standalone or 1 process modes.
>> So, any ideas for where to start looking?
>> technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64,
>> Red Hat 4.1.2-42
>> ----
>> Paul-Michael Agapow (paul-michael.agapow (at)
>> Bioinformatics, Centre for Infections, Health Protection Agency
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]

Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ