I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret.
I agree, though, that this is likely something peculiar to our particular setup. Of primary concern is that it might be related to the relatively old kernel (2.6.18) on these machines. There has been a lot of change since that kernel was released, and some of those changes may be relevant to this problem.
Unfortunately, upgrading the kernel will take persuasive argument. We are going to try and run the reproducers on machines with more modern kernels to see if we get different behavior.
Please feel free to follow this further on the ticket.
On Wed, 10 Jun 2009, Ralph Castain wrote:
I wasn't able to reproduce this. I have run with the following setup:Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?
- OS is Scientific Linux 5.1 with a custom compiled kernel based on 18.104.22.168, but (due to circumstances that I can't control):
checking if MCA component maffinity:libnuma can compile... no
- Intel compiler 10.1
- OpenMPI 1.3.2
- nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB DDR
I've used the platform file that you have provided, but took out the references to PanFS and fixed the paths. I've also used the MCA file that you have provided.
I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished successfully with m=50 several times. This, together with the earlier post also describing a negative result, points to a problem related to your particular setup...
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
devel mailing list