Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Fault tolerant ompi - Error: Unable to find a list of active MPIRUN processes on this machine.
From: Hellmüller Roman (hroman_at_[hidden])
Date: 2011-03-30 10:33:56


Hi

I'm trying to get fault tolerant ompi running on our cluster for my semesterthesis.

On the login node i was successful, checkpointing works.
Since the compute nodes have different kernels, i had to compile blcr on the compute nodes again. blcr on the compute nodes works. after that i installed openmpi (1.5.3) on the compute nodes. Letting a normal mpi program run works. also letting it run with -am ft-enable-cr works, but as soon as i would like to take a checkpoint it crashes:

hroman_at_node15 ~/semesterthesis/code/code1_heat1d $ mpirun -np 4 -am ft-enable-cr ./heatft_mpi

hroman_at_node15 ~ $ ps -a
  PID TTY TIME CMD
22488 pts/0 00:00:00 pbs_mom
22536 pts/0 00:00:00 bash
22631 pts/0 00:00:00 mpirun
22633 pts/0 00:00:03 heatft_mpi
22634 pts/0 00:00:03 heatft_mpi
22635 pts/0 00:00:03 heatft_mpi
22636 pts/0 00:00:03 heatft_mpi
22743 pts/1 00:00:00 ps

hroman_at_node15 ~ $ ompi-checkpoint 22631
--------------------------------------------------------------------------
Error: Unable to find a list of active MPIRUN processes on this machine.
       This could be due to one of the following:
        - The PID specified (22631) is not that of an active MPIRUN.
        - The session directory location could not be found/parsed.

       ompi-checkpoint attempted to find the session directory:
         /tmp//openmpi-sessions-hroman_at_node15_0
       Check to make sure that this directory exists while the MPIRUN
       process is running.

       Return Code: -13 (Not found)

--------------------------------------------------------------------------

I've tried it with an other application, that doesn't change anything. I also tried to set the checkpoint directorys in $prefix/ect/openmpi-mca-params.conf but that didn't seem to have any effect. however if i write errors in this file (smth that is no parameter eg. "hello world") it will complain, so it seems to read the file.
I also checked the environement variables but they seem to be ok, as far as i can tell.

do you have an idea where the error could be?

here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz> (40MB) you'll find the library and the build of openmpi & blcr as well as the env variables and the output of ompi_info. please let me know if more outputs are needed.

cheers
roman