Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ompi-restart issue : ompi-restart doesn't work across nodes - possible installation problem or environment setting problem??
From: arun dhakne (arundhakne_at_[hidden])
Date: 2008-10-09 22:15:59


These are the bt's of 2 cores ..

gdb hello core.14653

#0 0x000000300bc0cbc0 in ?? ()
#1 0x00002aaaab09d0fb in ?? ()
#2 0x00007fff6a782920 in ?? ()
#3 0x00002aaaaae3d348 in ?? ()
#4 0x00007fff6a7827b0 in ?? ()
#5 0x0000003806e6bcb4 in ?? ()
#6 0x0000000000000000 in ?? ()

gdb hello core.14654

#0 0x000000300bc0cbc0 in ?? ()
#1 0x00002aaaab09d0fb in ?? ()
#2 0x00007fff92eb3040 in ?? ()
#3 0x00002aaaaae3d348 in ?? ()
#4 0x00007fff92eb2ed0 in ?? ()
#5 0x0000003806e6bcb4 in ?? ()
#6 0x0000000000000000 in ?? ()

Please let me know if any other info is required.

On Thu, Oct 9, 2008 at 2:01 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:
> I cannot interpret the raw core files since they are specific your system
> and setup. Can you run it through gdb and get a backtrace? "gdb hello
> core.1234" then use the 'bt' command from inside gdb.
>
> That will help me start to focus in on the problem.
>
> Cheers,
> Josh
>
> On Oct 8, 2008, at 10:22 PM, arun dhakne wrote:
>
>> I have configured with the additional flags(--enable-ft-thread
>> --enable-mpi-threads) but there is no change in behaviour, it still
>> gives seg fault.
>> open mpi version:
>> Open MPI: 1.3a1r19685
>>
>> blcr version:
>> version 0.7.3
>>
>>
>> The core file is attached.
>> hello.c is sample mpi program whose core is dumped is also attached.
>>
>> ~]$ ompi-restart ompi_global_snapshot_11219.ckpt
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 11288 on node
>> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
>> fault).
>> --------------------------------------------------------------------------
>> 2 total processes killed (some possibly by mpirun during cleanup)
>>
>>
>> Best,
>>
>>
>> On Mon, Oct 6, 2008 at 6:44 PM, Josh Hursey <jjhursey_at_[hidden]> wrote:
>>>
>>> The installation looks ok, though I'm not sure what is causing the
>>> segfault
>>> of the restarted process. Two things to try. First can you send me a
>>> backtrace from the core file that is generated from the segmentation
>>> fault.
>>> That will provide insight into what is causing it.
>>>
>>> Second you may try to enable the C/R thread which allows for a checkpoint
>>> to
>>> progress when an application is in a computation loop instead of only
>>> when
>>> it is in the MPI library. To do so configure with these additional flags:
>>> --enable-ft-thread --enable-mpi-threads
>>>
>>> What version of Open MPI are you using? What version of BLCR?
>>>
>>> Best,
>>> Josh
>>>
>>> On Oct 6, 2008, at 3:55 PM, arun dhakne wrote:
>>>
>>>> Hi all,
>>>>
>>>> This is the procedure i have followed to install openmpi. Is there
>>>> some installation or environment setting problem in here?
>>>> an openmpi program with 4 process is run across 2 dual-core intel
>>>> machines, with 2 processes running on each of the machine.
>>>>
>>>> ompi-checkpoint is successful but ompi-restart fails with following
>>>> error
>>>>
>>>>
>>>> $:> ompi-restart ompi_global_snapshot_6045.ckpt
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 0 with PID 6372 on node
>>>> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
>>>> fault).
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> Open-mpi installation steps:
>>>> ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr
>>>> --with-blcr=/usr/lib64 --enable-debug
>>>> make
>>>> make install
>>>>
>>>>
>>>>
>>>> export
>>>>
>>>> LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/usr/lib64
>>>> export PATH=$HOME/.openmpi/bin:$PATH
>>>>
>>>> NOTE: blcr is installed as a module
>>>> $:> lsmod | grep blcr
>>>>
>>>> blcr 117892 0
>>>> blcr_vmadump 58264 1 blcr
>>>> blcr_imports 46080 2 blcr,blcr_vmadump
>>>>
>>>> Please let me know if there is problem with above procedure, thanks a
>>>> lot for your time.
>>>>
>>>> Best.
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: arun dhakne <arundhakne_at_[hidden]>
>>>> Date: Tue, Sep 30, 2008 at 12:52 AM
>>>> Subject: ompi-restart issue : ompi-restart doesn't work across nodes
>>>> To: Open MPI Users <users_at_[hidden]>
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> I had gone through some previous ompi-restart issues but i couldn't
>>>> find anything similar to this problem.
>>>>
>>>> I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645'
>>>>
>>>> i) If the sample mpi program say ( np 4 on single machine that is
>>>> without any hostfile )is ran and I try to checkpoint it, it happens
>>>> successfully and even ompi-restart works in this case.
>>>>
>>>> ii) If the sample mpi program is ran across say 2 different nodes and
>>>> checkpoint happens successfully BUT ompi-restart throws following
>>>> error:
>>>>
>>>> $ ompi-restart ompi_global_snapshot_7604.ckpt
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 3 with PID 9590 on node
>>>> acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation
>>>> fault).
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> Please let me know if more information is needed.
>>>>
>>>> --
>>>> Thanks and Regards,
>>>> Arun U. Dhakne
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>>
>> --
>> Thanks and Regards,
>> Arun U. Dhakne
>> Graduate Student
>> Computer Science and Engineering Dept.
>> State University of New York at Buffalo
>> <core.tar.gz><hello.c>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Thanks and Regards,
Arun U. Dhakne
Graduate Student
Computer Science and Engineering Dept.
State University of New York at Buffalo