Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-10-09 14:01:46


I cannot interpret the raw core files since they are specific your
system and setup. Can you run it through gdb and get a backtrace? "gdb
hello core.1234" then use the 'bt' command from inside gdb.

That will help me start to focus in on the problem.

Cheers,
Josh

On Oct 8, 2008, at 10:22 PM, arun dhakne wrote:

> I have configured with the additional flags(--enable-ft-thread
> --enable-mpi-threads) but there is no change in behaviour, it still
> gives seg fault.
> open mpi version:
> Open MPI: 1.3a1r19685
>
> blcr version:
> version 0.7.3
>
>
> The core file is attached.
> hello.c is sample mpi program whose core is dumped is also attached.
>
> ~]$ ompi-restart ompi_global_snapshot_11219.ckpt
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 11288 on node
> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
> fault).
> --------------------------------------------------------------------------
> 2 total processes killed (some possibly by mpirun during cleanup)
>
>
> Best,
>
>
> On Mon, Oct 6, 2008 at 6:44 PM, Josh Hursey <jjhursey_at_[hidden]>
> wrote:
>> The installation looks ok, though I'm not sure what is causing the
>> segfault
>> of the restarted process. Two things to try. First can you send me a
>> backtrace from the core file that is generated from the
>> segmentation fault.
>> That will provide insight into what is causing it.
>>
>> Second you may try to enable the C/R thread which allows for a
>> checkpoint to
>> progress when an application is in a computation loop instead of
>> only when
>> it is in the MPI library. To do so configure with these additional
>> flags:
>> --enable-ft-thread --enable-mpi-threads
>>
>> What version of Open MPI are you using? What version of BLCR?
>>
>> Best,
>> Josh
>>
>> On Oct 6, 2008, at 3:55 PM, arun dhakne wrote:
>>
>>> Hi all,
>>>
>>> This is the procedure i have followed to install openmpi. Is there
>>> some installation or environment setting problem in here?
>>> an openmpi program with 4 process is run across 2 dual-core intel
>>> machines, with 2 processes running on each of the machine.
>>>
>>> ompi-checkpoint is successful but ompi-restart fails with
>>> following error
>>>
>>>
>>> $:> ompi-restart ompi_global_snapshot_6045.ckpt
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 6372 on node
>>> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
>>> fault).
>>> --------------------------------------------------------------------------
>>>
>>> Open-mpi installation steps:
>>> ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr
>>> --with-blcr=/usr/lib64 --enable-debug
>>> make
>>> make install
>>>
>>>
>>>
>>> export
>>> LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/
>>> usr/lib64
>>> export PATH=$HOME/.openmpi/bin:$PATH
>>>
>>> NOTE: blcr is installed as a module
>>> $:> lsmod | grep blcr
>>>
>>> blcr 117892 0
>>> blcr_vmadump 58264 1 blcr
>>> blcr_imports 46080 2 blcr,blcr_vmadump
>>>
>>> Please let me know if there is problem with above procedure,
>>> thanks a
>>> lot for your time.
>>>
>>> Best.
>>>
>>> ---------- Forwarded message ----------
>>> From: arun dhakne <arundhakne_at_[hidden]>
>>> Date: Tue, Sep 30, 2008 at 12:52 AM
>>> Subject: ompi-restart issue : ompi-restart doesn't work across nodes
>>> To: Open MPI Users <users_at_[hidden]>
>>>
>>>
>>> Hi all,
>>>
>>> I had gone through some previous ompi-restart issues but i couldn't
>>> find anything similar to this problem.
>>>
>>> I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645'
>>>
>>> i) If the sample mpi program say ( np 4 on single machine that is
>>> without any hostfile )is ran and I try to checkpoint it, it happens
>>> successfully and even ompi-restart works in this case.
>>>
>>> ii) If the sample mpi program is ran across say 2 different nodes
>>> and
>>> checkpoint happens successfully BUT ompi-restart throws following
>>> error:
>>>
>>> $ ompi-restart ompi_global_snapshot_7604.ckpt
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 3 with PID 9590 on node
>>> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
>>> fault).
>>> --------------------------------------------------------------------------
>>>
>>> Please let me know if more information is needed.
>>>
>>> --
>>> Thanks and Regards,
>>> Arun U. Dhakne
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Thanks and Regards,
> Arun U. Dhakne
> Graduate Student
> Computer Science and Engineering Dept.
> State University of New York at Buffalo
> <core.tar.gz><hello.c>