Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] checkpointing multi node and multi process applications
From: Jean Potsam (jeanpotsam_at_[hidden])
Date: 2010-01-21 08:27:21


Hi Josh/all,

I have upgraded the openmpi to v 1.4  but still get the same error when I try executing the application on multiple nodes:

*******************
 Error: expected_component: PID information unavailable!
 Error: expected_component: Component Name information unavailable!
*******************

I am running my application from the node 'portal11' as follows:

mpirun -am ft-enable-cr -np 2 --hostfile hosts  myapp.

The file 'hosts' contains two host names: portal10, portal11.

I am triggering the checkpoint using ompi-checkpoint -v 'PID' from portal11.

I configured open mpi as follows:

#####################

./configure --prefix=/home/jean/openmpi/ --enable-picky --enable-debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries --enable-trace --enable-static=yes --enable-debug --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/blcr/ --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
#########################

Question:

Thank you

Jean

    

--- On Mon, 11/1/10, Josh Hursey <jjhursey_at_[hidden]> wrote:

From: Josh Hursey <jjhursey_at_[hidden]>
Subject: Re: [OMPI users] checkpointing multi node and multi process applications
To: "Open MPI Users" <users_at_[hidden]>
Date: Monday, 11 January, 2010, 21:42

On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:

> Hi Everyone,
>                        I am trying to checkpoint an mpi application running on multiple nodes. However, I get some error messages when i trigger the checkpointing process.
>
> Error: expected_component: PID information unavailable!
> Error: expected_component: Component Name information unavailable!
>
> I am using  open mpi 1.3 and blcr 0.8.1

Can you try the v1.4 release and see if the problem persists?

>
> I execute my application as follows:
>
> mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.
>
> My question:
>
> Does openmpi with blcr support checkpointing of multi node execution of mpi application? If so, can you provide me with some information on how to achieve this.

Open MPI is able to checkpoint a multi-node application (that's what it was designed to do). There are some examples at the link below:
  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php

-- Josh

>
> Cheers,
>
> Jean.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users