This is the reconsidered concept about our semi-automatic optimized parallel I/O system. Referring to the last discussion please take a look at the following link:
Thank Ralph and Jeff for giving me so many advises. The whole system has been reconsidered, please take a look at the attached pictures. As the parallel I/O is extremely complex, we have chosen the most important and impactive part - I/O algorithm - to start. As for the other parts (listed by Jeff), such as the MPI layer, the OS of the file system, the storage controller, the network and so on, it is easier to be taken into consideration one by one later (Hope I am not wrong :)).
Description of the picture
I/O System: The system we want to implement.
Other Systems: The systems outside the I/O system and contain the database, I/O monitor and the file systems like GPFS.
Step 1: The client sends the commands to the I/O nodes and starts the system deamon, which start the MPI process, on each node.
Step 2: After preparing in the system deamon, the MPI process starts running. All the necessary information such as the URI of database, the address of the source/target file in the file system, the I/O parameters, the number of processes used and so on is passed to the MPI process either as MPI hints or as the parameters of mpirun command.
Step 3 & 4: After the MPI_Init(), we can define a function named like MPI_IO_Select() to obtain the best I/O algorithm/pattern from the database. A similar algorithm selecting function has been implemented in the OMPIO under the fcoll module. I think it is possible to add the database accessing part in the source code of this module. In addition, accessing the file system to get the storage property before the I/O algorithm/pattern selection is also possible, if the file system offers such kind of API. Then the proper I/O algorithm/pattern with proper I/O parameters is applied in the next steps.
Step 5 & 6: The best I/O operation runs on the file system.
Step 7: After the end of the MPI process, the system deamon continues to do the further work.
Step 8 & 9: During the accessing of the file system, the monitor keeps watching the status of the file system and the performance of the I/O operation. The results or information will be collected and sent to the database for further analyzing. This part has no interaction with the MPI process or even the I/O system, therefore, it does not have to be real time.
The system decides the I/O operation according to several conditions in order to insure that the I/O operation will not be worse than the last similar I/O operation. It might have some self-study ability with the help of database. The changeable or optimizable part is NOT ONLY the I/O algorithm, but also the I/O related parameters.
We think it will be very useful for those applications, which usually run similar or long I/O operations.
Any suggestion is welcomed.