(Sorry for the delay in replying, more below)
On Apr 8, 2010, at 1:34 PM, Fernando Lemos wrote:
> I've noticed that ompi-restart doesn't support the --rankfile option.
> It only supports --hostfile/--machinefile. Is there any reason
> --rankfile isn't supported?
> Suppose you have a cluster without a shared file system. When one node
> fails, you transfer its checkpoint to a spare node and invoke
> ompi-restart. In 1.5, ompi-restart automagically handles this
> situation (if you supply a hostfile) and is able to restart the
> process, but I'm afraid it might not always be able to find the
> checkpoints this way. If you could specify to ompi-restart where the
> ranks are (and thus where the checkpoints are), then maybe restart
> would always work as long (as long as you've specified the location of
> the checkpoints correctly), or maybe ompi-restart would be faster.
We can easily add the --rankfile option to ompi-restart. I filed a
ticket to add this option, and assess if there are other options that
we should pass along (e.g., --npernode, --byhost). I should be able to
fix this in the next week or so, but the ticket is linked below so you
can follow the progress.
Most of the ompi-restart parameters are passed directly to the mpirun
command. ompi-restart is mostly a wrapper around mpirun that is able
to parse the metadata and create the appcontext file. I wonder if a
more general parameter like '--mpirun-args ...' might make sense so
users don't have to wait on me to expose the interface they need.
Donno. What do you think? Should I create a '--mpirun-args' option or
duplicate all of the mpirun command line parameters, or some
combination of the two.
> users mailing list