regards to your issue, do you have any indication when you get that
25m39s timing if there is a grotesque amount of time being spent in MPI
calls? Or, is the slowdown due to non-MPI portions?
Just to add my two cents: if this job can
be run on less than 8 processors (ideally, even on just 1), then I'd recommend doing so. That is, run it with OpenMPI and with MPICH2 on 1, 2 and 4 processors as well. If the single-processor jobs still give vastly different timings, then perhaps Eugene is on the right track and it comes down to various computational optimizations and not so much the message-passing that's make a difference. Timings from 2 and 4 process runs might be interesting as well to see how this difference changes with process counts.
I've seen differences between various MPI libraries before, but nothing quite this severe either. If I get the time, maybe I'll try to set up Gromacs tonight -- I've got both MPICH2 and OpenMPI installed here and can try to duplicate the runs. Sangamesh, is this a standard benchmark case that anyone can download and run?