Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
From: Terry Frankcombe (terry_at_[hidden])
Date: 2010-02-23 22:01:20


Vasp can be temperamental. For example, I have a largish system (384
atoms) for which Vasp hangs if I request more than 120 MD steps at a
time. I am not convinced that this is OPMI's problem. However, your
case looks much more diagnosable than my silent spinning hang.

On Tue, 2010-02-23 at 16:00 -0500, Thomas Sadowski wrote:
> Hello all,
>
>
> I am currently attempting to use OpenMPI as my MPI for my VASP
> calculations. VASP is an ab initio DFT code. Anyhow, I was able to
> compile and build OpenMPI v. 1.4.1 (i thought) correctly using the
> following command:
>
> ./configure --prefix=/home/tes98002 F77=ifort FC=ifort
> --with-tm=/usr/local
>
>
> Note that I am compiling OpenMPI for use with Torque/PBS which was
> compiled using Intel v 10 Fortran compilers and gcc for C\C++. After
> building OpenMPI, I successfully used it to compile VASP using Intel
> MKL v. 10.2. I am running OpenMPI in heterogeneous cluster
> configuration, and I used an NFS mount so that all the compute nodes
> could access the executable. Our hardware configuration is as follows:
>
> 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth
> called old nodes)
> 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth
> called new nodes)
>
> We are currently running SUSE v. 8.x. No we have problems when we
> attempt to run VASP on multiple nodes. A small system (~10 atoms) runs
> perfectly well with Torque and OpenMPI in all instances: running using
> single old node, a single new node, or across multiple old and new
> nodes. Larger systems (>24 atoms) are able to run to completion if
> they are kept within a single old or new node. However, if I try to
> run a job on multiple old or new nodes I receive a segfault. In
> particular the error is as follows:
>
>
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 6 with PID 11985 on node node11
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
>
>
>
> It seems to me that this is a memory issue, however I may be mistaken.
> I have searched the archive and have as yet seen an adequate treatment
> of the problem. I have also tried other versions of OpenMPI. Does
> anyone have any insight into our issues
>
>
> -Tom
>
>
>
>
>
>
>
> ______________________________________________________________________
> Hotmail: Trusted email with powerful SPAM protection. Sign up now.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users