On Wed, Jul 27, 2011 at 06:13:05PM +0200, Troels Haugboelle wrote:
> and we get good (+GB/s) performance when writing files from large runs.
> Interestingly, an alternative and conceptually simpler option is to
> use MPI_FILE_WRITE_ORDERED, but the performance of that function on
> Blue-Gene/P sucks - 20 MB/s instead of GB/s. I do not know why.
Ordered mode as implemented in ROMIO is awful. Entirely serialized.
We pass a token from process to process. Each process acquires the
token, updates the shared file pointer, does its i/o, then passes the
token to the next process.
What we should do, and have done in test branches , is use MPI_SCAN
to look at the shared file pointer once, tell all the processors their
offset, then update the shared file pointer while all processes do I/O
: Robert Latham, Robert Ross, and Rajeev Thakur. "Implementing
MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided
Communication". International Journal of High Performance Computing
Applications, 21(2):132-143, 2007
Since no one uses the shared file pointers, and even fewer people use
ordered mode, we just haven't seen the need to do so.
Do you want to rebuild your MPI library on BlueGene? I can pretty
quickly generate and send a patch that will make ordered mode go whip
> On 6/7/11 15:04 , Jeff Squyres wrote:
> >On Jun 7, 2011, at 4:53 AM, Troels Haugboelle wrote:
> >>In principle yes, but the problem is we have an unequal amount of particles on each node, so the length of each array is not guaranteed to be divisible by 2, 4 or any other number. If I have understood the definition of MPI_TYPE_CREATE_SUBARRAY correctly the offset can be 64-bit, but not the global array size, so, optimally, what I am looking for is something that has unequal size for each thread, simple vector, and with 64-bit offsets and global array size.
> >It's a bit awkward, but you can still make datatypes to give the offset that you want. E.g., if you need an offset of 2B+31 bytes, you can make datatype A with type contig of N=(2B/sizeof(int)) int's. Then make datatype B with type struct, containing type A and 31 MPI_BYTEs. Then use 1 instance of datatype B to get the offset that you want.
> >You could make utility functions that, given a specific (64 bit) offset, it makes an MPI datatype that matches the offset, and then frees it (and all sub-datatypes).
> >There is a bit of overhead in creating these datatypes, but it should be dwarfed by the amount of data that you're reading/writing, right?
> >It's awkward, but it should work.
> >>Another possible workaround would be to identify subsections that do not pass 2B elements, make sub communicators, and then let each of them dump their elements with proper offsets. It may work. The problematic architecture is a BG/P. On other clusters doing simple I/O, letting all threads open the file, seek to their position, and then write their chunk works fine, but somehow on BG/P performance drops dramatically. My guess is that there is some file locking, or we are overwhelming the I/O nodes..
> >>>This ticket for the MPI-3 standard is a first step in the right direction, but won't do everything you need (this is more FYI):
> >>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/265
> >>>See the PDF attached to the ticket; it's going up for a "first reading" in a month. It'll hopefully be part of the MPI-3 standard by the end of the year (Fab Tillier, CC'ed, has been the chief proponent of this ticket for the past several months).
> >>>Quincey Koziol from the HDF group is going to propose a follow on to this ticket, specifically about the case you're referring to -- large counts for file functions and datatype constructors. Quincey -- can you expand on what you'll be proposing, perchance?
> >>Interesting, I think something along the lines of the note would be very useful and needed for large applications.
> >>Thanks a lot for the pointers and your suggestions,
> users mailing list
Mathematics and Computer Science Division
Argonne National Lab, IL USA