The only thing that came to mind was that possibly on the second dump, the I/O was substantial enough to cause an overload of the OSS's (I/O servers) resulting in a process or task hang? Can you tell if your Lustre environment is getting overwhelmed when the Open MPI / FLASH combination checkpoints the second time? I know you write files > 2gb all the time, but if this particular combination is delivering that amount of data in a much shorter period of time.....

Just a thought :-\

Jeff F. Pummill
University of Arkansas

Brock Palen wrote:
I started a new run with some changes,

Shortening the run wont work well, it takes 3 days just to get  
through the AMR.

Brock Palen
Center for Advanced Computing

On Jan 25, 2008, at 3:01 PM, Daniel Pfenniger wrote:


Brock Palen wrote:
Is anyone using flash with openMPI?  we are here, but when ever it
tries to write its second checkpoint file it segfaults once it gets
to 2.2GB always in the same location.

Debugging is a pain as it takes 3 days to get to that point.  Just
wondering if anyone else has seen this same behavior.
Just to make testing faster you might think reducing the file output
interval (trstrt or nrstrt parameters in flash.par), and decrease the
resolution (lrefine_max) to produce smaller files and to see whether
the problem is related with the file size.


users mailing list


users mailing list