Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Possible openmpi bug?
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-20 11:24:01


Try adjusting this:
oob_tcp_peer_retries = 10

to be

oob_tcp_peer_retries = 1000

It should have given you an error if this failed, but let's give it a try
anyway.

You might also check to see if you are hitting memory limitations. If so, or
if you just want to try anyway, try reducing the value of
coll_sync_barrier_before.

Ralph

On Mon, Jul 20, 2009 at 9:17 AM, Steven Dale <steven_dale_at_[hidden]>wrote:

>
> Okay, now the plot is just getting weirder.
>
> I implemented most of the changes you recommend below. We are not running
> panasas, and our network is GB ethernet only, so I left the openib
> parameters out as well. I also recompiled with the switches suggested in the
> tlcc directory for the non-panasas file.
>
> Now our test case will run on 10 nodes with 160 permutations, which is a
> step forward. It does however still crash with a routed:binomial error on 10
> nodes with 1600 permutations after about 14 minutes. With 800 permutations,
> it runs quite happily as well.
>
> ....current openmpi-mca-param.conf is now:
>
> # $sysconf is a directory on a local disk, it is likely that changes
> # to this file will need to be propagated to other nodes. If $sysconf
> # is a directory that is shared via a networked filesystem, changes to
> # this file will be visible to all nodes that share this $sysconf.
>
> # The format is straightforward: one per line, mca_param_name =
> # rvalue. Quoting is ignored (so if you use quotes or escape
> # characters, they'll be included as part of the value). For example:
>
> # Disable run-time MPI parameter checking
> # mpi_param_check = 0
>
> # Note that the value "~/" will be expanded to the current user's home
> # directory. For example:
>
> # Change component loading path
> # component_path = /usr/local/lib/openmpi:~/my_openmpi_components
>
> # See "ompi_info --param all all" for a full listing of Open MPI MCA
> # parameters available and their default values.
> orte_abort_timeout = 10
> opal_set_max_sys_limits = 1
> orte_no_session_dirs = /usr,/users,/home,/hcadmin
> orte_tmpdir_base = /tmp
> orte_allocation_required = 1
> coll_sync_priority = 100
> coll_sync_barrier_before = 1000
> coll_hierarch_priority = 90
> oob_tcp_if_include=eth3
> oob_tcp_peer_retries = 10
> oob_tcp_disable_family = IPv6
> oob_tcp_listen_mode = listen_thread
> oob_tcp_sndbuf = 65536
> oob_tcp_rcvbuf = 65536
> btl = sm,tcp,self
> ## Setup MPI options
> mpi_show_handle_leaks = 0
> mpi_warn_on_fork = 1
>
> Current compilation looks like this:
>
> #!/bin/sh
>
> # Takes about 20-25 minutes
>
> PATH=$PATH:/usr/local/bin:;export PATH
> LDFLAGS="-m64"
> CFLAGS="-m64"
> CXXFLAGS="-m64"
> FCFLAGS="-m64"
> FFLAGS="-m64"
>
> # Build and install OpenMPI
>
> cd openmpi/openmpi-1.3.3
>
> sh ./configure --enable-dlopen=no --enable-binaries=yes --enable-shared=yes
> --enable-ipv6=no --enable-ft-thread=no
> --enable-mca-no-build=crs,filem,routed-linear,snapc,pml-dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm
> --with-slurm=yes --with-io-romio-flags="--with-file-system=ufs+nfs"
> --with-memory-manager=ptmalloc2 --with-wrapper-ldflags="-m64"
> --with-wrapper-cxxflags="-m64" --with-wrapper-fcflags="-m64"
> --with-wrapper-fflags="-m64"
>
> make
> make install
>
> ____________________
> Steve Dale
> Senior Platform Analyst
> Health Canada
>
>
>
> *Ralph Castain <rhc_at_[hidden]>*
> Sent by: users-bounces_at_[hidden]
>
> 07/17/2009 10:35 AM
> Please respond to
> Open MPI Users <users_at_[hidden]>
>
> To
> Open MPI Users <users_at_[hidden]> cc
> Subject
> Re: [OMPI users] Possible openmpi bug?
>
>
>
>
> Okay, just checking the obvious. :-)
>
> We regularly run with the exact same configuration here (i.e., slurm +
> 16cpus/node) without problem on jobs that are both short and long, so it
> seems doubtful that it would be an OMPI bug. However, it is possible as the
> difference could be due to configuration and/or parameter settings. We have
> seen some site-specific problems that are easily resolved with parameter
> changes.
>
> You might take a look at our (LANL's) platform files for our slurm-based
> system and see if they help. You will find them in the tarball at
>
> contrib/platform/lanl/tlcc
>
> Specifically, since you probably aren't running panasas (?), look at the
> optimized-nopanasas and optimized-nopanasas.conf (they are a pair) files to
> see how we configure the system for build, and the mca params we use to
> execute applications. If you can, I would suggest giving them a try
> (adjusting as required for your setup - e.g., you may want not want the -m64
> flags) and see if it resolves the problem.
>
> Ralph
>
> On Jul 17, 2009, at 7:15 AM, Steven Dale wrote:
>
>
> I think it unlikely that its a time limit thing. Firstly, slurm is set up
> with no time limit on jobs, and we get the same behaviour whether or not
> slurm is in the picture.
> In addition, we've run several other much larger jobs with a greater number
> of permutations and they complete fine.
>
> This job takes about 5-10 minutes to run. We've run jobs that take a week
> or more and the indivdual R process can be seen to run for days at a time
> and they run fine.
>
> In addition, I'd find it hard to believe (although I concede the
> possibility) that jobs entirely self-contained within the same box run
> slower that jobs which span 2 boxes over the network. (14 cpus vs 17 cpus
> for example).
>
>
> ____________________
> Steve Dale
> Senior Platform Analyst
> Health Canada
> Phone: (613)-948-4910
> E-mail: *steven_dale_at_[hidden]* <steven_dale_at_[hidden]>
>
> *Ralph Castain <**rhc_at_[hidden]* <rhc_at_[hidden]>*>*
> Sent by: *users-bounces_at_[hidden]* <users-bounces_at_[hidden]>
>
> 07/17/2009 01:13 AM
> Please respond to
> Open MPI Users <*users_at_[hidden]* <users_at_[hidden]>>
>
>
> To
> Open MPI Users <*users_at_[hidden]* <users_at_[hidden]>> cc
> Subject
> Re: [OMPI users] Possible openmpi bug?
>
>
>
>
>
>
> >From what I can see, it looks like your job is being terminated -
> something is killing mpirun. Is it possible that the job runs slowly enough
> on 14 or less cpus that it simply isn't completing within your specified
> time limit?
>
> The lifeline message simply indicates that a process self-aborted because
> it lost contact with its local daemon - in this case, mpirun (as that is
> always daemon 0) - which means that the daemon was terminated for some
> reason.
>
>
> On Jul 16, 2009, at 11:15 AM, Steven Dale wrote:
>
>
> Here is my situation:
>
> 2 Dell R900's with 16 cpus each and 64 GB RAM
> OS: SuSE SLES 10 SP2 patched up to date
> R version 2.9.1
> Rmpi version 0.5-7
> snow version 0.3-3
> maanova library version 1.14.0
> openmpi version 1.3.3
> slurm version 2.0.3
>
> With a given set of R code, we get abnormal exits when using 14 or less
> cpus. When using 15 or more, the job completes normally.
> error is a variation on:
>
> [pdp-dev-r01:22618] [[15549,1],0] routed:binomial: Connection to lifeline
> [[15549,0],0] lost
>
> during the array permutations.
>
> Increasing the number of permutations above 200 also produces similar
> results.
>
> The R code is executed with a typical command line for 14 cpus being:
>
> sbatch -n 14 -i ./Rtest.txt --mail-type=ALL *
> --mail-user=steven_dale_at_[hidden]* <--mail-user=steven_dale_at_[hidden]>/usr/local/bin/R --no-save
>
>
> Config.log, ompi_info, Rscript.txt and slurm outputs are attached. Network
> is GB Ethernet copper tcp/ip.
>
>
> I think this to be an openmpi error/bug due to the routed:binomial message.
> This also had the same results with openmpi-1.3.2, R 2.9.0, maanova 1.12 and
> slurm 2.0.1.
>
>
> No non-default MCA parameters are set.
>
> LD_LIBRARY_PATH=/usr/local/lib.
>
> Configuration done with defaults.
>
> Any ideas are welcome.
>
>
>
>
> ____________________
> Steve Dale
> <bugrep.tar.bz2>_______________________________________________
> users mailing list*
> **users_at_[hidden]* <users_at_[hidden]>*
> **http://www.open-mpi.org/mailman/listinfo.cgi/users*>
> _______________________________________________
> users mailing list*
> **users_at_[hidden]* <users_at_[hidden]>*
> **
http://www.open-mpi.org/mailman/listinfo.cgi/users*>
> _______________________________________________
> users mailing list*
> **users_at_[hidden]* <users_at_[hidden]>
>
http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>