Subject: Re: [OMPI users] Using a rankfile for ompi-restart
From: Fernando Lemos (fernandotcl_at_[hidden])
Date: 2010-05-21 08:45:20

On Tue, May 18, 2010 at 3:53 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:
>> I've noticed that ompi-restart doesn't support the --rankfile option.
>> It only supports --hostfile/--machinefile. Is there any reason
>> --rankfile isn't supported?
>> Suppose you have a cluster without a shared file system. When one node
>> fails, you transfer its checkpoint to a spare node and invoke
>> ompi-restart. In 1.5, ompi-restart automagically handles this
>> situation (if you supply a hostfile) and is able to restart the
>> process, but I'm afraid it might not always be able to find the
>> checkpoints this way. If you could specify to ompi-restart where the
>> ranks are (and thus where the checkpoints are), then maybe restart
>> would always work as long (as long as you've specified the location of
>> the checkpoints correctly), or maybe ompi-restart would be faster.
> We can easily add the --rankfile option to ompi-restart. I filed a ticket to
> add this option, and assess if there are other options that we should pass
> along (e.g., --npernode, --byhost). I should be able to fix this in the next
> week or so, but the ticket is linked below so you can follow the progress.

Nice, thanks!

> Most of the ompi-restart parameters are passed directly to the mpirun
> command. ompi-restart is mostly a wrapper around mpirun that is able to
> parse the metadata and create the appcontext file. I wonder if a more
> general parameter like '--mpirun-args ...' might make sense so users don't
> have to wait on me to expose the interface they need.
> Donno. What do you think? Should I create a '--mpirun-args' option or
> duplicate all of the mpirun command line parameters, or some combination of
> the two.

Well, I think an --mpirun-args argument would be even more useful, as
it's hard to foresee how ompi-restart is gonna be used. Maybe a
combination of the two would be ideal, since some options are going to
be used very often (i.e. --hostfile, --hosts, etc.).