Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing mpi4py program
From: ananda.mudar_at_[hidden]
Date: 2010-08-16 12:25:54


Josh

 

I have one more update on my observation while analyzing this issue.

 

Just to refresh, I am using openmpi-trunk release 23596 with
mpi4py-1.2.1 and BLCR 0.8.2. When I checkpoint the python script written
using mpi4py, the program doesn't progress after the checkpoint is taken
successfully. I tried it with openmpi 1.4.2 and then tried it with the
latest trunk version as suggested. I see the similar behavior in both
the releases.

 

I have one more interesting observation which I thought may be useful. I
tried the "-stop" option of ompi-checkpoint (trunk version) and the
mpirun prints the following error messages when I run the command
"ompi-checkpoint -stop -v <pid of mpirun>":

 

==== Error messages in the window where mpirun command was running START
======================================

[hpdcnln001:15148] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]

[hpdcnln001:15148] [[37739,1],2] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15149] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]

[hpdcnln001:15149] [[37739,1],3] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15146] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]

[hpdcnln001:15146] [[37739,1],0] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

[hpdcnln001:15147] Error: ( app) Passed an invalid handle (0) [5
="/tmp/openmpi-sessions-amudar_at_hpdcnln001_0/37739/1"]

[hpdcnln001:15147] [[37739,1],1] ORTE_ERROR_LOG: Error in file
../../../../../orte/mca/sstore/central/sstore_central_module.c at line
253

==== Error messages in the window where mpirun command was running END
======================================

 

Please note that the checkpoint image was created at the end of it.
However when I run the command "kill -CONT <pid of mpirun>", it fails to
move forward which is same as the original problem I have reported.

 

Let me know if you need any additional information.

 

Thanks for your time in advance

 

- Ananda

 

Ananda B Mudar, PMP

Senior Technical Architect

Wipro Technologies

Ph: 972 765 8093

ananda.mudar_at_[hidden]

 

From: Ananda Babu Mudar (WT01 - Energy and Utilities)
Sent: Sunday, August 15, 2010 11:25 PM
To: users_at_[hidden]
Subject: Re: [OMPI users] Checkpointing mpi4py program
Importance: High

 

Josh

I tried running the mpi4py program with the latest trunk version of
openmpi. I have compiled openmpi-1.7a1r23596 from trunk and recompiled
mpi4py to use this library. Unfortunately I see the same behavior as I
have seen with openmpi 1.4.2 ie; checkpoint will be successful but the
program doesn't proceed after that.

I have attached the stack traces of all the MPI processes that are part
of the mpirun. I really appreciate if you can take a look at the stack
trace and let m e know the potential problem. I am kind of stuck at this
point and need your assistance to move forward. Please let me know if
you need any additional information.

Thanks for your time in advance

Thanks

Ananda

-----Original Message-----
Subject: Re: [OMPI users] Checkpointing mpi4py program
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-08-13 12:28:31

Nope. I probably won't get to it for a while. I'll let you know if I do.

On Aug 13, 2010, at 12:17 PM, <ananda.mudar_at_[hidden]>
<ananda.mudar_at_[hidden]> wrote:

> OK, I will do that.
>
> But did you try this program on a system where the latest trunk is
> installed? Were you successful in checkpointing?
>
> - Ananda
> -----Original Message-----
> Message: 9
> Date: Fri, 13 Aug 2010 10:21:29 -0400
> From: Joshua Hursey <jjhursey_at_[hidden]>
> Subject: Re: [OMPI users] users Digest, Vol 1658, Issue 2
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <7A43615B-A462-4C72-8112-496653D8F0A0_at_[hidden]>
> Content-Type: text/plain; charset=us-ascii
>
> I probably won't have an opportunity to work on reproducing this on
the
> 1.4.2. The trunk has a bunch of bug fixes that probably will not be
> backported to the 1.4 series (things have changed too much since that
> branch). So I would suggest trying the 1.5 series.
>
> -- Josh
>
> On Aug 13, 2010, at 10:12 AM, <ananda.mudar_at_[hidden]>
> <ananda.mudar_at_[hidden]> wrote:
>
>> Josh
>>
>> I am having problems compiling the sources from the latest trunk. It
>> complains of libgomp.spec missing even though that file exists on my
>> system. I will see if I have to change any other environment
variables
>> to have a successful compilation. I will keep you posted.
>>
>> BTW, were you successful in reproducing the problem on a system with
>> OpenMPI 1.4.2?
>>
>> Thanks
>> Ananda
>> -----Original Message-----
>> Date: Thu, 12 Aug 2010 09:12:26 -0400
>> From: Joshua Hursey <jjhursey_at_[hidden]>
>> Subject: Re: [OMPI users] Checkpointing mpi4py program
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <1F1445AB-9208-4EF0-AF25-5926BD53C7E1_at_[hidden]>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Can you try this with the current trunk (r23587 or later)?
>>
>> I just added a number of new features and bug fixes, and I would be
>> interested to see if it fixes the problem. In particular I suspect
> that
>> this might be related to the Init/Finalize bounding of the checkpoint

>> region.
>>
>> -- Josh
>>
>> On Aug 10, 2010, at 2:18 PM, <ananda.mudar_at_[hidden]>
>> <ananda.mudar_at_[hidden]> wrote:
>>
>>> Josh
>>>
>>> Please find attached is the python program that reproduces the hang
>> that
>>> I described. Initial part of this file describes the prerequisite
>>> modules and the steps to reproduce the problem. Please let me know
if
>>> you have any questions in reproducing the hang.
>>>
>>> Please note that, if I add the following lines at the end of the
>> program
>>> (in case sleep_time is True), the problem disappears ie; program
>> resumes
>>> successfully after successful completion of checkpoint.
>>> # Add following lines at the end for sleep_time is True
>>> else:
>>> time.sleep(0.1)
>>> # End of added lines
>>>
>>>
>>> Thanks a lot for your time in looking into this issue.
>>>
>>> Regards
>>> Ananda
>>>
>>> Ananda B Mudar, PMP
>>> Senior Technical Architect
>>> Wipro Technologies
>>> Ph: 972 765 8093 begin_of_the_skype_highlighting 972
765 8093 end_of_the_skype_highlighting
>>> ananda.mudar_at_[hidden]
>>>
>>>
>>> -----Original Message-----
>>> Date: Mon, 9 Aug 2010 16:37:58 -0400
>>> From: Joshua Hursey <jjhursey_at_[hidden]>
>>> Subject: Re: [OMPI users] Checkpointing mpi4py program
>>> To: Open MPI Users <users_at_[hidden]>
>>> Message-ID: <270BD450-743A-4662-9568-1FEDFCC6F9C6_at_[hidden]>
>>> Content-Type: text/plain; charset=windows-1252
>>>
>>> I have not tried to checkpoint an mpi4py application, so I cannot
say
>>> for sure if it works or not. You might be hitting something with the

>>> Python runtime interacting in an odd way with either Open MPI or
> BLCR.
>>>
>>> Can you attach a debugger and get a backtrace on a stuck checkpoint?

>>> That might show us where things are held up.
>>>
>>> -- Josh
>>>
>>>
>>> On Aug 9, 2010, at 4:04 PM, <ananda.mudar_at_[hidden]>
>>> <ananda.mudar_at_[hidden]> wrote:
>>>
>>>> Hi
>>>>
>>>> I have integrated mpi4py with openmpi 1.4.2 that was built with
BLCR
>>> 0.8.2. When I run ompi-checkpoint on the program written using
> mpi4py,
>> I
>>> see that program doesn?t resume sometimes after successful
checkpoint
>>> creation. This doesn?t occur always meaning the program resumes
after
>>> successful checkpoint creation most of the time and completes
>>> successfully. Has anyone tested the checkpoint/restart functionality

>>> with mpi4py programs? Are there any best practices that I should
keep
>> in
>>> mind while checkpointing mpi4py programs?
>>>>
>>>> Thanks for your time
>>>> - Ananda
>>>> Please do not print this email unless it is absolutely necessary.
>>>>
>>>> The information contained in this electronic message and any
>>> attachments to this message are intended for the exclusive use of
the
>>> addressee(s) and may contain proprietary, confidential or privileged

>>> information. If you are not the intended recipient, you should not
>>> disseminate, distribute or copy this e-mail. Please notify the
sender
>>> immediately and destroy all copies of this message and any
>> attachments.
>>>>
>>>> WARNING: Computer viruses can be transmitted via email. The
> recipient
>>> should check this email and any attachments for the presence of
>> viruses.
>>> The company accepts no liability for any damage caused by any virus
>>> transmitted by this email.
>>>>
>>>> www.wipro.com
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Please do not print this email unless it is absolutely necessary.
>>
>> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any
attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient

> should check this email and any attachments for the presence of
viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>>
>> www.wipro.com
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
should check this email and any attachments for the presence of viruses.
The company accepts no liability for any damage caused by any virus
transmitted by this email.
>
> www.wipro.com
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com