Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Question about checkpoint/restart protocol
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-11-06 09:21:10


On Nov 5, 2009, at 4:46 AM, Mohamed Adel wrote:

> Dear Sergio,
>
> Thank you for your reply. I've inserted the modules into the kernel
> and it all worked fine. But there is still a weired issue. I use the
> command "mpirun -n 2 -am ft-enable-cr -H comp001 checkpoint-restart-
> test" to start the an mpi job. I then use "ompi-checkpoint PID" to
> checkpoint a job, but the ompi-checkpoint didn't respond and the
> mpirun produces the following.
>
> --------------------------------------------------------------------------
> An MPI process has executed an operation involving a call to the
> "fork()" system call to create a child process. Open MPI is currently
> operating in a condition that could result in memory corruption or
> other system errors; your MPI job may hang, crash, or produce silent
> data corruption. The use of fork() (or system() or other calls that
> create child processes) is strongly discouraged.
>
> The process that invoked fork was:
>
> Local host: comp001.local (PID 23514)
> MPI_COMM_WORLD rank: 0
>
> If you are *absolutely sure* that your application will successfully
> and correctly survive a call to fork(), you may disable this warning
> by setting the mpi_warn_on_fork MCA parameter to 0.
> --------------------------------------------------------------------------
> [login01.local:21425] 1 more process has sent help message help-mpi-
> runtime.txt / mpi_init:warn-fork
> [login01.local:21425] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
>
> Notice: if the -n option has a value more than 1, then this error
> occurs, but if the -n option has the value 1 then the ompi-
> checkpoint succeeds, mpirun produces the same message and ompi-
> restart fails with the message
> [login01:21417] *** Process received signal ***
> [login01:21417] Signal: Segmentation fault (11)
> [login01:21417] Signal code: Address not mapped (1)
> [login01:21417] Failing at address: (nil)
> [login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70]
> [login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/
> mca_crs_blcr.so [0x2b093509dfee]
> [login01:21417] [ 2] /home/mab/openmpi-1.3.3/lib/openmpi/
> mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) [0x2b093509d251]
> [login01:21417] [ 3] opal-restart [0x401c3e]
> [login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x32dea1d8b4]
> [login01:21417] [ 5] opal-restart [0x401399]
> [login01:21417] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 21417 on node
> login01.local exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> Any help with that will be appreciated?

I have not seen this behavior before. The first error is Open MPI
warning you that one of your MPI processes is trying to use fork(), so
you may want to make sure that your application is not using any system
() or fork() function calls. Open MPI internally should not be using
any of these functions from within the MPI library linked to the
application.

When you reloaded the BLCR module, did you rebuild Open MPI and
install it in a clean directory (not over the top of the old directory)?

Have you tried to checkpoint/restart an non-MPI process with BLCR on
your system? This will help to rule out installation problems with BLCR.

I suspect that Open MPI is not building correctly, or something in
your build environment is confusing/corrupting the build. Can you send
me your config.log, it may help me pinpoint the problem if it is build
related.

-- Josh

>
> Thanks in advance,
> Mohamed Adel
>
> ________________________________________
> From: users-bounces_at_[hidden] [users-bounces_at_[hidden]] On
> Behalf Of Sergio Díaz [sdiaz_at_[hidden]]
> Sent: Thursday, November 05, 2009 11:38 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Question about checkpoint/restart protocol
>
> Hi,
>
> Did you load the BLCR modules before compiling OpenMPI?
>
> Regards,
> Sergio
>
> Mohamed Adel escribió:
>> Dear OMPI users,
>>
>> I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those
>> options "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --
>> enable-ft-thread --with-ft=cr --enable-mpi-threads --enable-static
>> --disable-shared --with-blcr=/home/mab/blcr-0.8.2/" then compiled
>> and installed it successfully.
>> Now I'm trying to use the checkpoint/restart protocol. I run a
>> program with the options "mpirun -n 2 -am ft-enable-cr -H localhost
>> prime/checkpoint-restart-test" but I receive the following error:
>>
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [madel:28896] Abort before MPI_INIT completed successfully; not
>> able to guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> It looks like opal_init failed for some reason; your parallel
>> process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during opal_init; some of which are due to configuration or
>> environment problems. This failure appears to be an internal
>> failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>> opal_cr_init() failed failed
>> --> Returned value -1 instead of OPAL_SUCCESS
>> --------------------------------------------------------------------------
>> [madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
>> runtime/orte_init.c at line 77
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel
>> process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems. This failure appears to be an internal failure; here's
>> some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> ompi_mpi_init: orte_init failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>>
>> I can't find the files mentioned in this post "http://www.open-mpi.org/community/lists/users/2009/09/10641.php
>> " (mca_crs_blcr.so, mca_crs_blcr.la). Could you please help me with
>> that error?
>>
>> Thanks in advance
>> Mohamed Adel
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> --
> Sergio Díaz Montes
> Centro de Supercomputacion de Galicia
> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
> email: sdiaz_at_[hidden] ; http://www.cesga.es/
>
> ------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users