Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-02-24 21:25:46


*Usually*, I have see these "readv failed: ..." kinds of error messages as a side effect of an MPI process exiting abnormally. The "readv..." messages are from the peers that are left that suddenly had sockets close unexpectedly (because of the dead peer).

Check into the signal 11 message (that's a segv); that might be the real error.

On Feb 23, 2010, at 4:00 PM, Thomas Sadowski wrote:

> Hello all,
>
>
> I am currently attempting to use OpenMPI as my MPI for my VASP calculations. VASP is an ab initio DFT code. Anyhow, I was able to compile and build OpenMPI v. 1.4.1 (i thought) correctly using the following command:
>
> ./configure --prefix=/home/tes98002 F77=ifort FC=ifort --with-tm=/usr/local
>
>
> Note that I am compiling OpenMPI for use with Torque/PBS which was compiled using Intel v 10 Fortran compilers and gcc for C\C++. After building OpenMPI, I successfully used it to compile VASP using Intel MKL v. 10.2. I am running OpenMPI in heterogeneous cluster configuration, and I used an NFS mount so that all the compute nodes could access the executable. Our hardware configuration is as follows:
>
> 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth called old nodes)
> 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth called new nodes)
>
> We are currently running SUSE v. 8.x. No we have problems when we attempt to run VASP on multiple nodes. A small system (~10 atoms) runs perfectly well with Torque and OpenMPI in all instances: running using single old node, a single new node, or across multiple old and new nodes. Larger systems (>24 atoms) are able to run to completion if they are kept within a single old or new node. However, if I try to run a job on multiple old or new nodes I receive a segfault. In particular the error is as follows:
>
>
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 6 with PID 11985 on node node11 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
>
>
>
> It seems to me that this is a memory issue, however I may be mistaken. I have searched the archive and have as yet seen an adequate treatment of the problem. I have also tried other versions of OpenMPI. Does anyone have any insight into our issues
>
>
> -Tom
>
>
>
>
>
>
> Hotmail: Trusted email with powerful SPAM protection. Sign up now. _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/