Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-restart using different nodes
From: Jonathan Ferland (jonathan.ferland_at_[hidden])
Date: 2009-12-09 12:14:19


Hi Josh,

Thanks for helping. That solved the problem!!!

cheers,

Jonathan

Josh Hursey wrote:
> So I tried to reproduce this problem today, and everything worked fine
> for me using the trunk. I haven't tested v1.3/v1.4 yet.
>
> I tried checkpointing with one hostfile then restarting with each of
> the following:
> - No hostfile
> - a hostfile with completely different machines
> - a hostfile with the same machines in the opposite order
>
>
> I suspect that the problem is not with Open MPI, but your system
> interacting with BLCR. Usually when people cannot restart on a
> different node they have problems with the 'prelink' feature on Linux.
> BLCR has a FAQ item on this:
> https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
>
> So if this is your problem then you will probably not be able to
> checkpoint a single process (non-MPI) application on one node and
> restart on another. Sorry I didn't mention it before, must have
> slipped my mind.
>
> If this turns out to not be the problem, let me know and I'll take
> another look. Also send me any error messages that are displayed.
>
> -- Josh
>
>
> On Dec 8, 2009, at 1:39 PM, Jonathan Ferland wrote:
>
>> I did the same test using 1.3.4 and still the same issue.... I also
>> tried to use the tm interface instead of specifying the hostfile,
>> same result.
>>
>> thanks,
>>
>> Jonathan
>>
>> Josh Hursey wrote:
>>> Though I do not test this scenario (using hostfiles) very often, it
>>> used to work. The ompi-restart command takes a --hostfile (or
>>> --machinefile) argument that is passed directly to the mpirun
>>> command. I wonder if something broke recently with this handoff. I
>>> can certainly checkpoint with one set of nodes/allocation and
>>> restart with another, but most/all of my testing occurs in a SLURM
>>> environment, so no need for an explicit hostfile.
>>>
>>> I'll take a look to see if I can reproduce, but probably will not be
>>> until next week.
>>>
>>> -- Josh
>>>
>>> On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to use BLCR checkpointing in mpi. I am currently able
>>>> to run my application using some hostfile, checkpoint the run, and
>>>> then restart the application using the same hostfile. The thing I
>>>> would like to do is to restart the application with a different
>>>> hostfile. But this leads to a segfault using 1.3.3.
>>>>
>>>> Is it possible to restart the application using a different
>>>> hostfile (we are using pbs to create the hostfile, so each new
>>>> restart might be on different nodes), how can we do that? If no, do
>>>> you plan to include this in a future release?
>>>>
>>>> thanks
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>>
>>
>>
>>
>> --------------------------------------------------------------
>> Jonathan Ferland, analyste en calcul scientifique
>> RQCHP (Réseau québécois de calcul de haute performance)
>>
>> bureau S-252, pavillon Roger-Gaudry, Université de Montréal
>> téléphone : 514 343-6111 poste 8852
>> télécopieur : 514 343-2155
>> --------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
--------------------------------------------------------------
Jonathan Ferland, analyste en calcul scientifique
RQCHP (Réseau québécois de calcul de haute performance)
bureau S-252, pavillon Roger-Gaudry, Université de Montréal
téléphone   : 514 343-6111 poste 8852
télécopieur : 514 343-2155
--------------------------------------------------------------