Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
From: Terry Frankcombe (terry_at_[hidden])
Date: 2010-02-23 22:01:20


Vasp can be temperamental. For example, I have a largish system (384
atoms) for which Vasp hangs if I request more than 120 MD steps at a
time. I am not convinced that this is OPMI's problem. However, your
case looks much more diagnosable than my silent spinning hang.

On Tue, 2010-02-23 at 16:00 -0500, Thomas Sadowski wrote:
> Hello all,
>
>
> I am currently attempting to use OpenMPI as my MPI for my VASP
> calculations. VASP is an ab initio DFT code. Anyhow, I was able to
> compile and build OpenMPI v. 1.4.1 (i thought) correctly using the
> following command:
>
> ./configure --prefix=/home/tes98002 F77=ifort FC=ifort
> --with-tm=/usr/local
>
>
> Note that I am compiling OpenMPI for use with Torque/PBS which was
> compiled using Intel v 10 Fortran compilers and gcc for C\C++. After
> building OpenMPI, I successfully used it to compile VASP using Intel
> MKL v. 10.2. I am running OpenMPI in heterogeneous cluster
> configuration, and I used an NFS mount so that all the compute nodes
> could access the executable. Our hardware configuration is as follows:
>
> 7 nodes: 2 single-core AMD Opteron processors, 2GB of RAM (henceforth
> called old nodes)
> 4 nodes: 2 duo-core AMD Opteron processors, 2GB of RAM (henceforth
> called new nodes)
>
> We are currently running SUSE v. 8.x. No we have problems when we
> attempt to run VASP on multiple nodes. A small system (~10 atoms) runs
> perfectly well with Torque and OpenMPI in all instances: running using
> single old node, a single new node, or across multiple old and new
> nodes. Larger systems (>24 atoms) are able to run to completion if
> they are kept within a single old or new node. However, if I try to
> run a job on multiple old or new nodes I receive a segfault. In
> particular the error is as follows:
>
>
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> (104)[node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],1][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],3][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node12][[7759,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 6 with PID 11985 on node node11
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
> forrtl: error (78): process killed (SIGTERM)
>
>
>
> It seems to me that this is a memory issue, however I may be mistaken.
> I have searched the archive and have as yet seen an adequate treatment
> of the problem. I have also tried other versions of OpenMPI. Does
> anyone have any insight into our issues
>
>
> -Tom
>
>
>
>
>
>
>
> ______________________________________________________________________
> Hotmail: Trusted email with powerful SPAM protection. Sign up now.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users