On Mar 24, 2010, at 12:49 AM, Anton Starikov wrote:
> Two different OSes: centos 5.4 (2.6.18 kernel) and Fedora-12 (2.6.32 kernel)
> Two different CPUs: Opteron 248 and Opteron 8356.
> same binary for OpenMPI. Same binary for user code (vasp compiled for older arch)
Are you sure that the code is binary compatible between the two platforms? Can you repeat the process with native builds of Open MPI and the app for both architectures?
> When I supply rankfile, then depending on combo of OS and CPU results are different
> centos+Opt8356 : works
> centos+Opt248 : works
> fedora+Opt8356 : works
> fedora+Opt248 : fails
> rankfile is (in case of Opt248)
> rank 0=node014 slot=1
> rank 1=node014 slot=0
> I tried play with formats, leave one slot (and start one process) - it doesn't change result
> Without rankfile it works on all combos.
Nifty (meaning: ick!).
I wonder if the processor affinity code is causing the problem here...? It could be a problem in a heterogeneous environment if the systems are "close" but not "exact" in terms of binary compatibility...?
> Just in case, all this happens inside of cpuset which always wraps all slots given in rankfile (I just use torque with cpusets and my custom patch for torque which also creates rankfile for openmpi, in this case MPI tasks are bound to particular cores and multithreaded codes limited by given cpuset).
> AFAIR, it also works without problem on both hardware setups with 1.3.x/1.4.0 and 2.6.30 kernel from OpenSuSE 11.1.
> Strangely, but when I run OSU benchmarks (osu_bw etc), it works without any problems.
Can you re-run with a trivial test, like MPI hello world and/or ring? See the examples/ directory.
For corporate legal information go to: