Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] checkpointing multi node and multi process applications
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2010-01-25 15:51:28


Actually, let me roll that back a bit. I was preparing a custom patch
for the v1.4 series, and it seems that the code does not have the bug
I mentioned. It is only the v1.5 and trunk that were effected by this.
The v1.4 series should be fine.

I will still ask that the error message fix be brought over to the
v1.4 branch, but it is unlikely to fix your problem. However it would
be useful to know if upgrading to the trunk or v1.5 series fixes this
problem. The v1.4 series has an old version of the file and metadata
handling mechanisms, so I am encouraging people to move to the v1.5
series if possible.

-- Josh

On Jan 25, 2010, at 3:33 PM, Josh Hursey wrote:

> So while working on the error message, I noticed that the global
> coordinator was using the wrong path to investigate the checkpoint
> metadata. This particular section of code is not often used (which
> is probably why I could not reproduce). I just committed a fix to
> the Open MPI development trunk:
> https://svn.open-mpi.org/trac/ompi/changeset/22479
>
> Additionally, I am asking for this to be brought over to the v1.4
> and v1.5 release branches:
> https://svn.open-mpi.org/trac/ompi/ticket/2195
> https://svn.open-mpi.org/trac/ompi/ticket/2196
>
> It seems to solve the problem as I could reproduce it. Can you try
> the trunk (either SVN checkout or nightly tarball from tonight) and
> check if this solves your problem?
>
> Cheers,
> Josh
>
> On Jan 25, 2010, at 12:14 PM, Josh Hursey wrote:
>
>> I am not able to reproduce this problem with the 1.4 branch using a
>> hostfile, and node configuration like you mentioned.
>>
>> I suspect that the error is caused by a failed local checkpoint.
>> The error message is triggered when the global coordinator (located
>> in 'mpirun') tries to read the metadata written by the application
>> in the local snapshot. If the global coordinator cannot properly
>> read the metadata, then it will print a variety of error messages
>> depending on what is going wrong.
>>
>> If these are the only two errors produced, then this typically
>> means that the local metadata file has been found, but is empty/
>> corrupted. Can you send me the contents of the local checkpoint
>> metadata file:
>> shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/0/
>> opal_snapshot_0.ckpt/snapshot_meta.data
>>
>> It should look something like:
>> ---------------------------------
>> #
>> # PID: 23915
>> # Component: blcr
>> # CONTEXT: ompi_blcr_context.23915
>> ---------------------------------
>>
>> It may also help to see the following metadata file as well:
>> shell$ cat GLOBAL_SNAPSHOT_DIR/ompi_global_snapshot_YYY.ckpt/
>> global_snapshot_meta.data
>>
>>
>> If there are other errors printed by the process, that would
>> potentially indicate a different problem. So if there are, let me
>> know.
>>
>> This error message should be a bit more specific about which
>> process checkpoint is causing the problem, and what the this
>> usually indicates. I filed a bug to cleanup the error:
>> https://svn.open-mpi.org/trac/ompi/ticket/2190
>>
>> -- Josh
>>
>> On Jan 21, 2010, at 8:27 AM, Jean Potsam wrote:
>>
>>> Hi Josh/all,
>>>
>>> I have upgraded the openmpi to v 1.4 but still get the same error
>>> when I try executing the application on multiple nodes:
>>>
>>> *******************
>>> Error: expected_component: PID information unavailable!
>>> Error: expected_component: Component Name information unavailable!
>>> *******************
>>>
>>> I am running my application from the node 'portal11' as follows:
>>>
>>> mpirun -am ft-enable-cr -np 2 --hostfile hosts myapp.
>>>
>>> The file 'hosts' contains two host names: portal10, portal11.
>>>
>>> I am triggering the checkpoint using ompi-checkpoint -v 'PID' from
>>> portal11.
>>>
>>>
>>> I configured open mpi as follows:
>>>
>>> #####################
>>>
>>> ./configure --prefix=/home/jean/openmpi/ --enable-picky --enable-
>>> debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-
>>> stacktrace --enable-binaries --enable-trace --enable-static=yes --
>>> enable-debug --with-devel-headers=1 --with-mpi-param-check=always
>>> --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ --
>>> with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
>>> #########################
>>>
>>> Question:
>>>
>>> what do you think can be wrong? Please instruct me on how to
>>> resolve this problem.
>>>
>>> Thank you
>>>
>>> Jean
>>>
>>>
>>>
>>>
>>> --- On Mon, 11/1/10, Josh Hursey <jjhursey_at_[hidden]> wrote:
>>>
>>> From: Josh Hursey <jjhursey_at_[hidden]>
>>> Subject: Re: [OMPI users] checkpointing multi node and multi
>>> process applications
>>> To: "Open MPI Users" <users_at_[hidden]>
>>> Date: Monday, 11 January, 2010, 21:42
>>>
>>>
>>> On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:
>>>
>>> > Hi Everyone,
>>> > I am trying to checkpoint an mpi
>>> application running on multiple nodes. However, I get some error
>>> messages when i trigger the checkpointing process.
>>> >
>>> > Error: expected_component: PID information unavailable!
>>> > Error: expected_component: Component Name information unavailable!
>>> >
>>> > I am using open mpi 1.3 and blcr 0.8.1
>>>
>>> Can you try the v1.4 release and see if the problem persists?
>>>
>>> >
>>> > I execute my application as follows:
>>> >
>>> > mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.
>>> >
>>> > My question:
>>> >
>>> > Does openmpi with blcr support checkpointing of multi node
>>> execution of mpi application? If so, can you provide me with some
>>> information on how to achieve this.
>>>
>>> Open MPI is able to checkpoint a multi-node application (that's
>>> what it was designed to do). There are some examples at the link
>>> below:
>>> http://www.osl.iu.edu/research/ft/ompi-cr/examples.php
>>>
>>> -- Josh
>>>
>>> >
>>> > Cheers,
>>> >
>>> > Jean.
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users_at_[hidden]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users