Sounds like something is making the TCP connections unstable. Last time I looked at HVM, they were running something like 64G of memory? If you have more than one proc on a node (as your output would indicate), and you are doing collectives on such large data sizes, it's quite possible you are running out of memory due to the way the collective algorithms work - and perhaps trashing the connection (which would explain it being unreachable until the OS can reset it).

You might try running with fewer procs/node to see if that helps.


On Apr 4, 2013, at 11:10 AM, Yevgeny Popkov <ypopkov@gmail.com> wrote:

Hi,

I am running some matrix-algebra-based calculations on Amazon EC2 (HVM instances running Ubuntu 11.1 with OpenMPI 1.6.4 and python bindings with mpi4py 1.3). I am using StarCluster to spin up instances so all nodes from a given cluster are in the same placement group (i.e. are on the same 10 Gb network) 

My calculations are CPU-bound and I use just a few collective operations (namely allgatherv, statterv, bcast, and reduce) that exchange a non-trivial amount data (the size of full distributed dense matrix reaches tens of gigabytes - e.g. I use allgatherv on that matrix) 

For smaller matrix sizes everything works fine but once I start increasing the number of parameters in my models and, as a result, increase the number of nodes/workers to keep up I get errors like this one:

[node005][[18726,1],125][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node008][[18726,1],8][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node008][[18726,1],108][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node008][[18726,1],28][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node007][[18726,1],7][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node001][[18726,1],21][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

I've also seen other network-related errors such as "unable to find path to host". Whenever I get these errors one or more of the EC2 nodes becomes "unreachable" according EC2 Web UI (even though I can log in to those nodes using internal IP aliases) Such nodes typically recover from being "unreachable" after a few minutes but my MPI job hangs anyway. The culprit is usually allgatherv but I've seen reduce and bcast to cause these errors as well.

I don't get this errors if I run on a single node (but that's way too slow even with 16 workers so I need to run my jobs on at least 20 nodes)

Any idea how to fix this? May be by adjusting some linux and/or OpenMPI parameters?

Any help would greatly appreciated!

Thanks,
Yevgeny

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users