I use openmpi-1.3b3r20000 and blcr-0.7.3 to run my application on 2 nodes. I configure openmpi to run on 2 nodes for default.
I want to use checkpoint/restart functionalities, so I use this command to configure openmpi:
# .configure --with-devel-headers --with-ft=cr --with-blcr=<path_to_blcr>
First: I run application well with this command "mpitun -np 4 <my_app>", but after checkpoint I can't restart application. The error return is bus error, signal 7. To fix it, you tell me add "-mca btl ^sm" to mpirun, it run well. But I want to know why.
Second: I can't checkpoint application with --term option. Checkpoint command not return, snapshot be created but it wasn't returned to localhost. Daemon on remote node died before local snapshot returned, but processes on localhost not die.
Third: When I restart an application, I can't checkpoint this. Checkpoint command not return and restart process died with signal 13 (Broken Pipe).
With my first-class experience, I can't understand why, please help me.