Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun only works when -np <4 (Gus Correa) RESOLVED FOR NOW
From: Matthew MacManes (macmanes_at_[hidden])
Date: 2009-12-10 11:28:03


Mark,

Exciting.. SOLVED.. There is an open ticket #2043 regarding Nehelem/OpenMPI/Hang problem (https://svn.open-mpi.org/trac/ompi/ticket/2043).. Seems like the problem might be specific to gcc4.4x and OMPI <1.3.2.. It seems like there is a group up us with dual socket nehalems trying to use ompi without much luck (or at least not without headaches)..

Of note, mca btl_sm_num_fifos 7 seems to work as well..

now off to see if I can get some real code to work...

Thanks, Mark, Gus, and the rest of the OMPI Users Group!

On Dec 10, 2009, at 7:42 AM, Mark Bolstad wrote:

>
> Just a quick interjection, I also have a dual-quad Nehalem system, HT on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads --enable-mpi-f77=no --with-openib=no
>
> With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity fails randomly with np >= 8. Turning on -v increased the success, but still hangs. np = 16 fails more often, and the hang is random in which pair of processes are communicating.
>
> However, it seems to be related to the shared memory layer problem. Running with -mca btl ^sm works consistently through np = 128.
>
> Hope this helps.
>
> Mark
>
> On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa <gus_at_[hidden]> wrote:
> Hi Matthew
>
> Save any misinterpretation I may have made of the code:
>
> Hello_c has no real communication, except for a final Barrier
> synchronization.
> Each process prints "hello world" and that's it.
>
> Ring probes a little more, with processes Send(ing) and
> Recv(cieving) messages.
> Ring just passes a message sequentially along all process
> ranks, then back to rank 0, and repeat the game 10 times.
> Rank 0 is in charge of counting turns, decrementing the counter,
> and printing that (nobody else prints).
> With 4 processes:
> 0->1->2->3->0->1... 10 times
>
> In connectivity every pair of processes exchange a message.
> Therefore it probes all pairwise connections.
> In verbose mode you can see that.
>
> These programs shouldn't hang at all, if the system were sane.
> Actually, they should even run with a significant level of
> oversubscription, say,
> -np 128 should work easily for all three programs on a powerful
> machine like yours.
>
>
> **
>
> Suggestions
>
> 1) Stick to the OpenMPI you compiled.
>
> **
>
> 2) You can run connectivity_c in verbose mode:
>
> home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v
>
> (Note the trailing "-v".)
>
> It should tell more about who's talking to who.
>
> **
>
> 3) I wonder if there are any BIOS settings that may be required
> (and perhaps not in place) to make the Nehalem hyperthreading to
> work properly in your computer.
>
> You reach the BIOS settings by typing <DEL> or <F2>
> when the computer boots up.
> The key varies by
> BIOS and computer vendor, but shows quickly on the bootup screen.
>
> You may ask the computer vendor about the recommended BIOS settings.
> If you haven't done this before, be careful to change and save only
> what really needs to change (if anything really needs to change),
> or the result may be worse.
> (Overclocking is for gamers, not for genome researchers ... :) )
>
> **
>
> 4) What I read about Nehalem DDR3 memory is that it is optimal
> on configurations that are multiples of 3GB per CPU.
> Common configs. in dual CPU machines like yours are
> 6, 12, 24 and 48GB.
> The sockets where you install the memory modules also matter.
>
> Your computer has 20GB.
> Did you build the computer or upgrade the memory yourself?
> Do you know how the memory is installed, in which memory sockets?
> What does the vendor have to say about it?
>
> See this:
> http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx
>
> **
>
> 5) As I said before, typing "f" then "j" on "top" will add
> a column (labeled "P") that shows in which core each process is running.
> This will let you observe how the Linux scheduler is distributing
> the MPI load across the cores.
> Hopefully it is load-balanced, and different processes go to different
> cores.
>
> ***
>
> It is very disconcerting when MPI processes hang.
> You are not alone.
> The reasons are not always obvious.
> At least in your case there is no network involved or to troubleshoot.
>
>
> **
>
> I hope it helps,
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
>
>
>
> Matthew MacManes wrote:
> Hi Gus and List,
>
> 1st of all Gus, I want to say thanks.. you have been a huge help, and when I get this fixed, I owe you big time!
>
> However, the problems continue...
>
> I formatted the HD, reinstalled OS to make sure that I was working from scratch. I did your step A, which seemed to go fine:
>
> macmanes_at_macmanes:~$ which mpicc
> /home/macmanes/apps/openmpi1.4/bin/mpicc
> macmanes_at_macmanes:~$ which mpirun
> /home/macmanes/apps/openmpi1.4/bin/mpirun
>
> Good stuff there...
>
> I then compiled the example files:
>
> macmanes_at_macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c
> Process 0 sending 10 to 1, tag 201 (8 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> Process 1 exiting
> Process 2 exiting
> Process 3 exiting
> Process 4 exiting
> Process 5 exiting
> Process 6 exiting
> Process 7 exiting
> macmanes_at_macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
> Connectivity test on 8 processes PASSED.
> macmanes_at_macmanes:~/Downloads/openmpi-1.4/examples$ /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
> ..HANGS..NO OUTPUT
>
> this is maddening because ring_c works.. and connectivity_c worked the 1st time, but not the second... I did it 10 times, and it worked twice.. here is the TOP screenshot:
>
> http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394
>
> What is the difference between connectivity_c and ring_c? Under what circumstances should one fail and not the other...
>
> I'm off to the Linux forums to see about the Nehalem kernel issues..
>
> Matt
>
>
>
> On Wed, Dec 9, 2009 at 13:25, Gus Correa <gus_at_[hidden] <mailto:gus_at_[hidden]>> wrote:
>
> Hi Matthew
>
> There is no point in trying to troubleshoot MrBayes and ABySS
> if not even the OpenMPI test programs run properly.
> You must straighten them out first.
>
> **
>
> Suggestions:
>
> **
>
> A) While you are at OpenMPI, do yourself a favor,
> and install it from source on a separate directory.
> Who knows if the OpenMPI package distributed with Ubuntu
> works right on Nehalem?
> Better install OpenMPI yourself from source code.
> It is not a big deal, and may save you further trouble.
>
> Recipe:
>
> 1) Install gfortran and g++ if you don't have them using apt-get.
> 2) Put the OpenMPI tarball in, say /home/matt/downolads/openmpi
> 3) Make another install directory *not in the system directory tree*.
> Something like "mkdir /home/matt/apps/openmpi-X.Y.Z/" (X.Y.Z=version)
> will work
> 4) cd /home/matt/downolads/openmpi
> 5) ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran \
> --prefix=/home/matt/apps/openmpi-X.Y.Z
> (Use the prefix flag to install in the directory of item 3.)
> 6) make
> 7) make install
> 8) At the bottom of your /home/matt/.bashrc or .profile file
> put these lines:
>
> export PATH=/home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
> export MANPATH=/home/matt/apps/openmpi-X.Y.Z/share/man:`man -w`
> export
> LD_LIBRARY_PATH=home/matt/apps/openmpi-X.Y.Z/lib:${LD_LIBRARY_PATH}
>
> (If you use csh/tcsh use instead:
> setenv PATH /home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
> etc)
>
> 9) Logout and login again to freshen um the environment variables.
> 10) Do "which mpicc" to check that it is pointing to your newly
> installed OpenMPI.
> 11) Recompile and rerun the OpenMPI test programs
> with 2, 4, 8, 16, .... processors.
> Use full path names to mpicc and to mpirun,
> if the change of PATH above doesn't work right.
>
> ********
>
> B) Nehalem is quite new hardware.
> I don't know if the Ubuntu kernel 2.6.31-16 fully supports all
> of Nehalem features, particularly hyperthreading, and NUMA,
> which are used by MPI programs.
> I am not the right person to give you advice about this.
> I googled out but couldn't find a clear information about
> minimal kernel age/requirements to have Nehalem fully supported.
> Some Nehalem owner in the list could come forward and tell.
>
> **
>
> C) On the top screenshot you sent me, please try it again
> (after you do item A) but type "f" and "j" to show the processors
> that are running each process.
>
> **
>
> D) Also, the screeshot shows 20GB of memory.
> This sounds not as a optimal memory for Nehalem,
> which tend to be 6GB, 12GB, 24GB, 48GB.
> Did you put together the system, or upgraded the memory yourself,
> of did you buy the computer as is?
> However, this should not break MPI anyway.
>
> **
>
> E) Answering your question:
> It is true that different flavors of MPI
> used to compile (mpicc) and run (mpiexec) a program would probably
> break right away, regardless of the number of processes.
> However, when it comes to different versions of the
> same MPI flavor (say OpenMPI 1.3.4 and OpenMPI 1.3.3)
> I am not sure it will break.
> I would guess it may run but not in a reliable way.
> Problems may appear as you stress the system with more cores, etc.
> But this is just a guess.
>
> **
>
> I hope this helps,
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> Matthew MacManes wrote:
>
> Hi Gus,
>
> Interestingly the results for the connectivity_c test... works
> fine with -np <8. For -np >8 it works some of the time, other
> times it HANGS. I have got to believe that this is a big clue!!
> Also, when it hangs, sometimes I get the message "mpirun was
> unable to cleanly terminate the daemons on the nodes shown
> below" Note that NO nodes are shown below. Once, I got -np 250
> to pass the connectivity test, but I was not able to replicate
> this reliable, so I'm not sure if it was a fluke, or what. Here
> is a like to a screenshop of TOP when connectivity_c is hung
> with -np 14.. I see that 2 processes are only at 50% CPU usage..
> Hmmmm http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink
> <http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink>
> <http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink
> <http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink>>
>
>
> The other tests, ring_c, hello_c, as well as the cxx versions of
> these guys with with all values of -np.
>
> Using -mca mpi-paffinity_alone 1 I get the same behavior.
> I agree that I am should worry about the mismatch between where
> the libraries are installed versus where I am telling my
> programs to look for them. Would this type of mismatch cause
> behavior like what I am seeing, i.e. working with a small
> number of processors, but failing with larger? It seems like a
> mismatch would have the same effect regardless of the number of
> processors used. Maybe I am mistaken. Anyway, to address this,
> which mpirun gives me /usr/local/bin/mpirun.. so to configure
> ./configure --with-mpi=/usr/local/bin/mpirun and to run
> /usr/local/bin/mpirun -np X ... This should
> uname -a gives me: Linux macmanes 2.6.31-16-generic #52-Ubuntu
> SMP Thu Dec 3 22:07:16 UTC 2006 x86_64 GNU/Linux
>
> Matt
>
> On Dec 8, 2009, at 8:50 PM, Gus Correa wrote:
>
> Hi Matthew
>
> Please see comments/answers inline below.
>
> Matthew MacManes wrote:
>
> Hi Gus, Thanks for your ideas.. I have a few questions,
> and will try to answer yours in hopes of solving this!!
>
>
> A simple way to test OpenMPI on your system is to run the
> test programs that come with the OpenMPI source code,
> hello_c.c, connectivity_c.c, and ring_c.c:
> http://www.open-mpi.org/
>
> Get the tarball from the OpenMPI site, gzip and untar it,
> and look for it in the "examples" directory.
> Compile it with /your/path/to/openmpi/bin/mpicc hello_c.c
> Run it with /your/path/to/openmpi/bin/mpiexec -np X a.out
> using X = 2, 4, 8, 16, 32, 64, ...
>
> This will tell if your OpenMPI is functional,
> and if you can run on many Nehalem cores,
> even with oversubscription perhaps.
> It will also set the stage for further investigation of your
> actual programs.
>
>
> Should I worry about setting things like --num-cores
> --bind-to-cores? This, I think, gets at your questions
> about processor affinity.. Am I right? I could not
> exactly figure out the -mca mpi-paffinity_alone stuff...
>
>
> I use the simple minded -mca mpi-paffinity_alone 1.
> This is probably the easiest way to assign a process to a core.
> There more complex ways in OpenMPI, but I haven't tried.
> Indeed, -mca mpi-paffinity_alone 1 does improve performance of
> our programs here.
> There is a chance that without it the 16 virtual cores of
> your Nehalem get confused with more than 3 processes
> (you reported that -np > 3 breaks).
>
> Did you try adding just -mca mpi-paffinity_alone 1 to
> your mpiexec command line?
>
>
> 1. Additional load: nope. nothing else, most of the time
> not even firefox.
>
>
> Good.
> Turn off firefox, etc, to make it even better.
> Ideally, use runlevel 3, no X, like a computer cluster node,
> but this may not be required.
>
> 2. RAM: no problems apparent when monitoring through
> TOP. Interesting, I did wonder about oversubscription,
> so I tried the option --nooversubscription, but this
> gave me an error mssage.
>
>
> Oversubscription from your program would only happen if
> you asked for more processes than available cores, i.e.,
> -np > 8 (or "virtual" cores, in case of Nehalem hyperthreading,
> -np > 16).
> Since you have -np=4 there is no oversubscription,
> unless you have other external load (e.g. Matlab, etc),
> but you said you don't.
>
> Yet another possibility would be if your program is threaded
> (e.g. using OpenMP along with MPI), but considering what you
> said about OpenMP I would guess the programs don't use it.
> For instance, you launch the program with 4 MPI processes,
> and each process decides to start, say, 8 OpenMP threads.
> You end up with 32 threads and 8 (real) cores (or 16
> hyperthreaded
> ones on Nehalem).
>
>
> What else does top say?
> Any hog processes (memory- or CPU-wise)
> besides your program processes?
>
> 3. I have not tried other MPI flavors.. Ive been
> speaking to the authors of the programs, and they are
> both using openMPI.
>
> I was not trying to convince you to use another MPI.
> I use MPICH2 also, but OpenMPI reigns here.
> The idea or trying it with MPICH2 was just to check whether
> OpenMPI
> is causing the problem, but I don't think it is.
>
> 4. I don't think that this is a problem, as I'm
> specifying --with-mpi=/usr/bin/... when I compile the
> programs. Is there any other way to be sure that this is
> not a problem?
>
>
> Hmmm ....
> I don't know about your Ubuntu (we have CentOS and Fedora on
> various
> machines).
> However, most Linux distributions come with their MPI flavors,
> and so do compilers, etc.
> Often times they install these goodies in unexpected places,
> and this has caused a lot of frustration.
> There are tons of postings on this list that eventually
> boiled down to mismatched versions of MPI in unexpected places.
>
>
> The easy way is to use full path names to compile and to run.
> Something like this:
> /my/openmpi/bin/mpicc on your program configuration script),
>
> and something like this
> /my/openmpi/bin/mpiexec -np ... bla, bla ...
> when you submit the job.
>
> You can check your version with "which mpicc", "which mpiexec",
> and (perhaps using full path names) with
> "ompi_info", "mpicc --showme", "mpiexec --help".
>
>
> 5. I had not been, and you could see some shuffling when
> monitoring the load on specific processors. I have tried
> to use --bind-to-cores to deal with this. I don't
> understand how to use the -mca options you asked about.
> 6. I am using Ubuntu 9.10. gcc 4.4.1 and g++ 4.4.1
>
>
> I am afraid I won't be of help, because I don't have Nehalem.
> However, I read about Nehalem requiring quite recent kernels
> to get all of its features working right.
>
> What is the output of "uname -a"?
> This will tell the kernel version, etc.
> Other list subscribers may give you a suggestion if you post the
> information.
>
> MyBayes is a for bayesian phylogenetics:
> http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page
> ABySS: is a program for assembly of DNA sequence data:
> http://www.bcgsc.ca/platform/bioinfo/software/abyss
>
>
> Thanks for the links!
> I had found the MrBayes link.
> I eventually found what your ABySS was about, but no links.
> Amazing that it is about DNA/gene sequencing.
> Our abyss here is the deep ocean ... :)
> Abysmal difference!
>
> Do the programs mix MPI (message passing) with
> OpenMP (threads)?
>
> Im honestly not sure what this means..
>
>
> Some programs mix the two.
> OpenMP only works in a shared memory environment (e.g. a single
> computer like yours), whereas MPI can use both shared memory
> and work across a network (e.g. in a cluster).
> There are other differences too.
>
> Unlikely that you have this hybrid type of parallel program,
> otherwise there would be some reference to OpenMP
> on the very program configuration files, program
> documentation, etc.
> Also, in general the configuration scripts of these hybrid
> programs can turn on MPI only, or OpenMP only, or both,
> depending on how you configure.
>
> Even to compile with OpenMP you would need a proper compiler
> flag, but that one might be hidden in a Makefile too, making
> a bit hard to find. "grep -n mp Makefile" may give a clue.
> Anything on the documentation that mentions threads or OpenMP?
>
> FYI, here is OpenMP:
> http://openmp.org/wp/
>
> Thanks for all your help!
>
> > Matt
>
> Well, so far it didn't really help. :(
>
> But let's hope to find a clue,
> maybe with a little help of
> our list subscriber friends.
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
> Hi Matthew
>
> More guesses/questions than anything else:
>
> 1) Is there any additional load on this machine?
> We had problems like that (on different machines) when
> users start listening to streaming video, doing
> Matlab calculations,
> etc, while the MPI programs are running.
> This tends to oversubscribe the cores, and may lead
> to crashes.
>
> 2) RAM:
> Can you monitor the RAM usage through "top"?
> (I presume you are on Linux.)
> It may show unexpected memory leaks, if they exist.
>
> On "top", type "1" (one) see all cores, type "f"
> then "j"
> to see the core number associated to each process.
>
> 3) Do the programs work right with other MPI flavors
> (e.g. MPICH2)?
> If not, then it is not OpenMPI's fault.
>
> 4) Any possibility that the MPI versions/flavors of
> mpicc and
> mpirun that you are using to compile and launch the
> program are not the
> same?
>
> 5) Are you setting processor affinity on mpiexec?
>
> mpiexec -mca mpi_paffinity_alone 1 -np ... bla, bla ...
>
> Context switching across the cores may also cause
> trouble, I suppose.
>
> 6) Which Linux are you using (uname -a)?
>
> On other mailing lists I read reports that only
> quite recent kernels
> support all the Intel Nehalem processor features well.
> I don't have Nehalem, I can't help here,
> but the information may be useful
> for other list subscribers to help you.
>
> ***
>
> As for the programs, some programs require specific
> setup,
> (and even specific compilation) when the number of
> MPI processes
> vary.
> It may help if you tell us a link to the program sites.
>
> Baysian statistics is not totally out of our business,
> but phylogenetic genetic trees is not really my league,
> hence forgive me any bad guesses, please,
> but would it need specific compilation or a different
> set of input parameters to run correctly on a different
> number of processors?
> Do the programs mix MPI (message passing) with
> OpenMP (threads)?
>
> I found this MrBayes, which seems to do the above:
>
> http://mrbayes.csit.fsu.edu/
> http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page
>
> As for the ABySS, what is it, where can it be found?
> Doesn't look like a deep ocean circulation model, as
> the name suggest.
>
> My $0.02
> Gus Correa
>
> ------------------------------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _________________________________
> Matthew MacManes
> PhD Candidate
> University of California- Berkeley
> Museum of Vertebrate Zoology
> Phone: 510-495-5833
> Lab Website: http://ib.berkeley.edu/labs/lacey
> Personal Website: http://macmanes.com/
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_________________________________
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/