In case it help, I am running 1.3.3 compiled as follow :
../configure --enable-ft-thread --with-ft=cr --enable-mpi-threads
--with-blcr=... --with-blcr-libdir=...--disable-openib-rdmacm --prefix=....
I ran my application like this :
mpirun -am ft-enable-cr --hostfile host -np 2 ./a.out
where host contains:
This way it work if I checkpoint restart :
ompi-restart -hostfile host ompi_global_snapshot_....ckpt
but if I then change the host to (just swapping nodes):
then it crash...
Josh Hursey wrote:
> Though I do not test this scenario (using hostfiles) very often, it
> used to work. The ompi-restart command takes a --hostfile (or
> --machinefile) argument that is passed directly to the mpirun command.
> I wonder if something broke recently with this handoff. I can
> certainly checkpoint with one set of nodes/allocation and restart with
> another, but most/all of my testing occurs in a SLURM environment, so
> no need for an explicit hostfile.
> I'll take a look to see if I can reproduce, but probably will not be
> until next week.
> -- Josh
> On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:
>> I am trying to use BLCR checkpointing in mpi. I am currently able to
>> run my application using some hostfile, checkpoint the run, and then
>> restart the application using the same hostfile. The thing I would
>> like to do is to restart the application with a different hostfile.
>> But this leads to a segfault using 1.3.3.
>> Is it possible to restart the application using a different hostfile
>> (we are using pbs to create the hostfile, so each new restart might
>> be on different nodes), how can we do that? If no, do you plan to
>> include this in a future release?
>> users mailing list
> users mailing list