The only thing that came to mind was that possibly on the second dump,
the I/O was substantial enough to cause an overload of the OSS's (I/O
servers) resulting in a process or task hang? Can you tell if your
Lustre environment is getting overwhelmed when the Open MPI / FLASH
combination checkpoints the second time? I know you write files > 2gb
all the time, but if this particular combination is delivering that
amount of data in a much shorter period of time.....
Just a thought :-\
Jeff F. Pummill
University of Arkansas
Brock Palen wrote:
> I started a new run with some changes,
> Shortening the run wont work well, it takes 3 days just to get
> through the AMR.
> Brock Palen
> Center for Advanced Computing
> On Jan 25, 2008, at 3:01 PM, Daniel Pfenniger wrote:
>> Brock Palen wrote:
>>> Is anyone using flash with openMPI? we are here, but when ever it
>>> tries to write its second checkpoint file it segfaults once it gets
>>> to 2.2GB always in the same location.
>>> Debugging is a pain as it takes 3 days to get to that point. Just
>>> wondering if anyone else has seen this same behavior.
>> Just to make testing faster you might think reducing the file output
>> interval (trstrt or nrstrt parameters in flash.par), and decrease the
>> resolution (lrefine_max) to produce smaller files and to see whether
>> the problem is related with the file size.
>> users mailing list
> users mailing list