On Feb 29, 2012, at 5:39 AM, adrian sabou wrote:
> I am experiencing a rather unpleasant issue with a simple OpenMPI app. I have 4 nodes communicating with a central node. Performance is good and the application behaves as it should. (i.e. performance steadily decreases as I increase the work size). My problem is that immediately after messages passed between nodes become larger that 128 KB performance drops suddenly in an unexpected way. I have done some research and tried to modify various eager limits, without any success. I am a beginner in OpenMPI and I can't seem to figure out this issue. I am hopping that one of you might shed some light on this situation. My OpenMPI version is 1.5.4 on Ubuntu Server 10.04 64 bit. Any help is welcome. Thanks.
Lots of things can be a factor here (I assume you're using TCP over Ethernet?):
- are you using a network switch or hub?
- what kind of switch/hub is it? (switch quality can have a *lot* to do with network performance, and I don't say that just because of my employer :-) )
- is this a point-to-point pattern, or are multiple nodes communicating simultaneously? (I'm asking about network contention)
- how many procs are you running on each node? Are they all communicating simultaneously from each node?
- is the performance degradation only when communicating over TCP? Or does it happen when communicating over shared memory? Or both?
I think you probably want to test what happens with a simple point-to-point benchmark between two peers on different nodes, and observe the performance there. If you have a problem on your network or setup, you'll see it there. Then expand your testing to include multiple procs simultaneously (e.g., running the same 2-proc point-to-point benchmark multiple times simultaneously) and see what happens.
If all that looks good, then start looking hard at your application communication pattern. When you hit 128 KB message size, are you exhausting cache sizes, or creating some other kind of algorithmic congestion? Look for things like this.
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/