Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-restart using different nodes
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-12-09 11:28:54

So I tried to reproduce this problem today, and everything worked fine
for me using the trunk. I haven't tested v1.3/v1.4 yet.

I tried checkpointing with one hostfile then restarting with each of
the following:
  - No hostfile
  - a hostfile with completely different machines
  - a hostfile with the same machines in the opposite order

I suspect that the problem is not with Open MPI, but your system
interacting with BLCR. Usually when people cannot restart on a
different node they have problems with the 'prelink' feature on Linux.
BLCR has a FAQ item on this:

So if this is your problem then you will probably not be able to
checkpoint a single process (non-MPI) application on one node and
restart on another. Sorry I didn't mention it before, must have
slipped my mind.

If this turns out to not be the problem, let me know and I'll take
another look. Also send me any error messages that are displayed.

-- Josh

On Dec 8, 2009, at 1:39 PM, Jonathan Ferland wrote:

> I did the same test using 1.3.4 and still the same issue.... I also
> tried to use the tm interface instead of specifying the hostfile,
> same result.
> thanks,
> Jonathan
> Josh Hursey wrote:
>> Though I do not test this scenario (using hostfiles) very often, it
>> used to work. The ompi-restart command takes a --hostfile (or --
>> machinefile) argument that is passed directly to the mpirun
>> command. I wonder if something broke recently with this handoff. I
>> can certainly checkpoint with one set of nodes/allocation and
>> restart with another, but most/all of my testing occurs in a SLURM
>> environment, so no need for an explicit hostfile.
>> I'll take a look to see if I can reproduce, but probably will not
>> be until next week.
>> -- Josh
>> On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:
>>> Hi,
>>> I am trying to use BLCR checkpointing in mpi. I am currently able
>>> to run my application using some hostfile, checkpoint the run, and
>>> then restart the application using the same hostfile. The thing I
>>> would like to do is to restart the application with a different
>>> hostfile. But this leads to a segfault using 1.3.3.
>>> Is it possible to restart the application using a different
>>> hostfile (we are using pbs to create the hostfile, so each new
>>> restart might be on different nodes), how can we do that? If no,
>>> do you plan to include this in a future release?
>>> thanks
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> --
> --------------------------------------------------------------
> Jonathan Ferland, analyste en calcul scientifique
> RQCHP (Réseau québécois de calcul de haute performance)
> bureau S-252, pavillon Roger-Gaudry, Université de Montréal
> téléphone : 514 343-6111 poste 8852
> télécopieur : 514 343-2155
> --------------------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]