I'm trying the new released Open MPI 1.3 in conjunction with BLCR to
provide the checkpoint/restart-feature.
Configured with ./configure --prefix=/usr/local --with-ft=cr
--enable-ft-thread --enable-mpi-threads --with-blcr=/
A MPI-job on a single machine (several threads) is checkpointed and
restarted very well.
The checkpoint of a MPI-job across two hosts (ethernet, tcp) is also
done without warnings or errors (the homedir and the directory, where
the MPI-Application is, are shared with NFS). The restart works too, but
all threads are only started on the host, where I enter the ompi-restart
command. Even if I add the -hostfile argument to ompi-restart, only the
one host is used.
Does anybody has a hint?