Truly am sorry about that - we were just talking today about the need to update and improve our FAQ on running on large clusters. Did you by any chance look at it? Would appreciate any thoughts on how it should be improved from a user's perspective.
On Sep 20, 2011, at 3:28 PM, Henderson, Brent wrote:
Nope, but if I didn’t that would have saved me about an hour of coding time!
I’m still curious if it would be beneficial to inject some barriers at certain locations so that if you had a slow node, not everyone would end up connecting to it all at once. Anyway, if I get access to another large TCP cluster, I’ll give it a try.
Hmmm....perhaps you didn't notice the mpi_preconnect_all option? It does precisely what you described - it pushes zero-byte messages around a ring to force all the connections open at MPI_Init.
On Sep 20, 2011, at 3:06 PM, Henderson, Brent wrote:
I recently had access to a 200+ node Magny Cours (24 ranks/host) 10G Linux cluster. I was able to use OpenMPI v1.5.4 with hello world, IMB and HPCC, but there were a couple of issues along the way. After setting some system tunables up a little bit on all of the nodes a hello_world program worked just fine – it appears that the TCP connections between most or all of the ranks are deferred until they are actually used so the easy test ran reasonably quickly. I then moved to IMB.
I typically don’t care about the small rank counts, so I add the –npmin 99999 option to just run the ‘big’ number of ranks. This ended with an abort after MPI_Init(), but before running any tests. Lots (possibly all) of ranks emitted messages that looked like:
‘[n112][[13200,1],1858][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 172.23.4.1 failed: Connection timed out (110)’
Where n112 is one of the nodes in the job, and 172.23.4.1 is the first node in the job. One of the first things that IMB does before running a test is create a communicator for each specific rank count it is testing. Apparently this collective operation causes a large number of connections to be made. The abort messages (one example shown above) all show the connect failure to a single node, so it would appear that a very large number of nodes attempt to connect to that one at the same time and overwhelmed it. (Or it was slow and everyone ganged up on it as they worked their way around the ring. J Is there a supported/suggested way to work around this? It was very repeatable.
I was able to work around this by using the primary definitions for MPI_Init() and MPI_Init_thread() by calling the ‘P’ version of the routine, and then having each rank send its rank number to the rank one to the right, then two to the right, and so-on around the ring. I added a MPI_Barrier( MPI_COMM_WORLD ), call every N messages to keep things at a controlled pace. N was 64 by default, but settable via environment variable in case that number didn’t work well for some reason. This fully connected the mesh (110k socket connections per host!) and allowed the tests to run. Not a great solution, I know, but I’ll throw it out there until I know the right way.
Once I had this in place, I used the workaround with HPCC as well. Without it, it would not get very far at all. With it, I was able to make it through the entire test.
Looking forward to getting the experts thoughts about the best way to handle big TCP clusters – thanks!
P.S. v1.5.4 worked *much* better that v1.4.3 on this cluster – not sure why, but kudos to those working on changes since then!