I was wondering what support Open MPI has for allowing a job to
continue running when one or more processes in the job die
unexpectedly? Is there a special mpirun flag for this? Any other ways?
It seems obvious that collectives will fail once a process dies, but
would it be possible to create a new group (if you knew which ranks
are dead) that excludes the dead processes - then turn this group into
a working communicator?