Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)
From: Mark Bolstad (the.render.dude_at_[hidden])
Date: 2009-12-10 10:42:49


Just a quick interjection, I also have a dual-quad Nehalem system, HT on,
24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
--enable-mpi-f77=no --with-openib=no

With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity
fails randomly with np >= 8. Turning on -v increased the success, but still
hangs. np = 16 fails more often, and the hang is random in which pair of
processes are communicating.

However, it seems to be related to the shared memory layer problem. Running
with -mca btl ^sm works consistently through np = 128.

Hope this helps.

Mark

On Wed, Dec 9, 2009 at 8:03 PM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Matthew
>
> Save any misinterpretation I may have made of the code:
>
> Hello_c has no real communication, except for a final Barrier
> synchronization.
> Each process prints "hello world" and that's it.
>
> Ring probes a little more, with processes Send(ing) and
> Recv(cieving) messages.
> Ring just passes a message sequentially along all process
> ranks, then back to rank 0, and repeat the game 10 times.
> Rank 0 is in charge of counting turns, decrementing the counter,
> and printing that (nobody else prints).
> With 4 processes:
> 0->1->2->3->0->1... 10 times
>
> In connectivity every pair of processes exchange a message.
> Therefore it probes all pairwise connections.
> In verbose mode you can see that.
>
> These programs shouldn't hang at all, if the system were sane.
> Actually, they should even run with a significant level of
> oversubscription, say,
> -np 128 should work easily for all three programs on a powerful
> machine like yours.
>
>
> **
>
> Suggestions
>
> 1) Stick to the OpenMPI you compiled.
>
> **
>
> 2) You can run connectivity_c in verbose mode:
>
> home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c -v
>
> (Note the trailing "-v".)
>
> It should tell more about who's talking to who.
>
> **
>
> 3) I wonder if there are any BIOS settings that may be required
> (and perhaps not in place) to make the Nehalem hyperthreading to
> work properly in your computer.
>
> You reach the BIOS settings by typing <DEL> or <F2>
> when the computer boots up.
> The key varies by
> BIOS and computer vendor, but shows quickly on the bootup screen.
>
> You may ask the computer vendor about the recommended BIOS settings.
> If you haven't done this before, be careful to change and save only
> what really needs to change (if anything really needs to change),
> or the result may be worse.
> (Overclocking is for gamers, not for genome researchers ... :) )
>
> **
>
> 4) What I read about Nehalem DDR3 memory is that it is optimal
> on configurations that are multiples of 3GB per CPU.
> Common configs. in dual CPU machines like yours are
> 6, 12, 24 and 48GB.
> The sockets where you install the memory modules also matter.
>
> Your computer has 20GB.
> Did you build the computer or upgrade the memory yourself?
> Do you know how the memory is installed, in which memory sockets?
> What does the vendor have to say about it?
>
> See this:
>
> http://en.community.dell.com/blogs/dell_tech_center/archive/2009/04/08/nehalem-and-memory-configurations.aspx
>
> **
>
> 5) As I said before, typing "f" then "j" on "top" will add
> a column (labeled "P") that shows in which core each process is running.
> This will let you observe how the Linux scheduler is distributing
> the MPI load across the cores.
> Hopefully it is load-balanced, and different processes go to different
> cores.
>
> ***
>
> It is very disconcerting when MPI processes hang.
> You are not alone.
> The reasons are not always obvious.
> At least in your case there is no network involved or to troubleshoot.
>
>
> **
>
> I hope it helps,
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
>
>
>
> Matthew MacManes wrote:
>
>> Hi Gus and List,
>>
>> 1st of all Gus, I want to say thanks.. you have been a huge help, and when
>> I get this fixed, I owe you big time!
>>
>> However, the problems continue...
>>
>> I formatted the HD, reinstalled OS to make sure that I was working from
>> scratch. I did your step A, which seemed to go fine:
>>
>> macmanes_at_macmanes:~$ which mpicc
>> /home/macmanes/apps/openmpi1.4/bin/mpicc
>> macmanes_at_macmanes:~$ which mpirun
>> /home/macmanes/apps/openmpi1.4/bin/mpirun
>>
>> Good stuff there...
>>
>> I then compiled the example files:
>>
>> macmanes_at_macmanes:~/Downloads/openmpi-1.4/examples$
>> /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 ring_c
>> Process 0 sending 10 to 1, tag 201 (8 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> Process 0 decremented value: 7
>> Process 0 decremented value: 6
>> Process 0 decremented value: 5
>> Process 0 decremented value: 4
>> Process 0 decremented value: 3
>> Process 0 decremented value: 2
>> Process 0 decremented value: 1
>> Process 0 decremented value: 0
>> Process 0 exiting
>> Process 1 exiting
>> Process 2 exiting
>> Process 3 exiting
>> Process 4 exiting
>> Process 5 exiting
>> Process 6 exiting
>> Process 7 exiting
>> macmanes_at_macmanes:~/Downloads/openmpi-1.4/examples$
>> /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
>> Connectivity test on 8 processes PASSED.
>> macmanes_at_macmanes:~/Downloads/openmpi-1.4/examples$
>> /home/macmanes/apps/openmpi1.4/bin/mpirun -np 8 connectivity_c
>> ..HANGS..NO OUTPUT
>>
>> this is maddening because ring_c works.. and connectivity_c worked the 1st
>> time, but not the second... I did it 10 times, and it worked twice.. here is
>> the TOP screenshot:
>>
>>
>> http://picasaweb.google.com/macmanes/DropBox?authkey=Gv1sRgCLKokNOVqo7BYw#5413382182027669394
>>
>> What is the difference between connectivity_c and ring_c? Under what
>> circumstances should one fail and not the other...
>>
>> I'm off to the Linux forums to see about the Nehalem kernel issues..
>>
>> Matt
>>
>>
>>
>> On Wed, Dec 9, 2009 at 13:25, Gus Correa <gus_at_[hidden] <mailto:
>> gus_at_[hidden]>> wrote:
>>
>> Hi Matthew
>>
>> There is no point in trying to troubleshoot MrBayes and ABySS
>> if not even the OpenMPI test programs run properly.
>> You must straighten them out first.
>>
>> **
>>
>> Suggestions:
>>
>> **
>>
>> A) While you are at OpenMPI, do yourself a favor,
>> and install it from source on a separate directory.
>> Who knows if the OpenMPI package distributed with Ubuntu
>> works right on Nehalem?
>> Better install OpenMPI yourself from source code.
>> It is not a big deal, and may save you further trouble.
>>
>> Recipe:
>>
>> 1) Install gfortran and g++ if you don't have them using apt-get.
>> 2) Put the OpenMPI tarball in, say /home/matt/downolads/openmpi
>> 3) Make another install directory *not in the system directory tree*.
>> Something like "mkdir /home/matt/apps/openmpi-X.Y.Z/" (X.Y.Z=version)
>> will work
>> 4) cd /home/matt/downolads/openmpi
>> 5) ./configure CC=gcc CXX=g++ F77=gfortran FC=gfortran \
>> --prefix=/home/matt/apps/openmpi-X.Y.Z
>> (Use the prefix flag to install in the directory of item 3.)
>> 6) make
>> 7) make install
>> 8) At the bottom of your /home/matt/.bashrc or .profile file
>> put these lines:
>>
>> export PATH=/home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
>> export MANPATH=/home/matt/apps/openmpi-X.Y.Z/share/man:`man -w`
>> export
>> LD_LIBRARY_PATH=home/matt/apps/openmpi-X.Y.Z/lib:${LD_LIBRARY_PATH}
>>
>> (If you use csh/tcsh use instead:
>> setenv PATH /home/matt/apps/openmpi-X.Y.Z/bin:${PATH}
>> etc)
>>
>> 9) Logout and login again to freshen um the environment variables.
>> 10) Do "which mpicc" to check that it is pointing to your newly
>> installed OpenMPI.
>> 11) Recompile and rerun the OpenMPI test programs
>> with 2, 4, 8, 16, .... processors.
>> Use full path names to mpicc and to mpirun,
>> if the change of PATH above doesn't work right.
>>
>> ********
>>
>> B) Nehalem is quite new hardware.
>> I don't know if the Ubuntu kernel 2.6.31-16 fully supports all
>> of Nehalem features, particularly hyperthreading, and NUMA,
>> which are used by MPI programs.
>> I am not the right person to give you advice about this.
>> I googled out but couldn't find a clear information about
>> minimal kernel age/requirements to have Nehalem fully supported.
>> Some Nehalem owner in the list could come forward and tell.
>>
>> **
>>
>> C) On the top screenshot you sent me, please try it again
>> (after you do item A) but type "f" and "j" to show the processors
>> that are running each process.
>>
>> **
>>
>> D) Also, the screeshot shows 20GB of memory.
>> This sounds not as a optimal memory for Nehalem,
>> which tend to be 6GB, 12GB, 24GB, 48GB.
>> Did you put together the system, or upgraded the memory yourself,
>> of did you buy the computer as is?
>> However, this should not break MPI anyway.
>>
>> **
>>
>> E) Answering your question:
>> It is true that different flavors of MPI
>> used to compile (mpicc) and run (mpiexec) a program would probably
>> break right away, regardless of the number of processes.
>> However, when it comes to different versions of the
>> same MPI flavor (say OpenMPI 1.3.4 and OpenMPI 1.3.3)
>> I am not sure it will break.
>> I would guess it may run but not in a reliable way.
>> Problems may appear as you stress the system with more cores, etc.
>> But this is just a guess.
>>
>> **
>>
>> I hope this helps,
>>
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>>
>> Matthew MacManes wrote:
>>
>> Hi Gus,
>>
>> Interestingly the results for the connectivity_c test... works
>> fine with -np <8. For -np >8 it works some of the time, other
>> times it HANGS. I have got to believe that this is a big clue!!
>> Also, when it hangs, sometimes I get the message "mpirun was
>> unable to cleanly terminate the daemons on the nodes shown
>> below" Note that NO nodes are shown below. Once, I got -np 250
>> to pass the connectivity test, but I was not able to replicate
>> this reliable, so I'm not sure if it was a fluke, or what. Here
>> is a like to a screenshop of TOP when connectivity_c is hung
>> with -np 14.. I see that 2 processes are only at 50% CPU usage..
>> Hmmmm
>> http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink
>> <
>> http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink
>> >
>> <
>> http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink
>> <
>> http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink
>> >>
>>
>>
>> The other tests, ring_c, hello_c, as well as the cxx versions of
>> these guys with with all values of -np.
>>
>> Using -mca mpi-paffinity_alone 1 I get the same behavior.
>> I agree that I am should worry about the mismatch between where
>> the libraries are installed versus where I am telling my
>> programs to look for them. Would this type of mismatch cause
>> behavior like what I am seeing, i.e. working with a small
>> number of processors, but failing with larger? It seems like a
>> mismatch would have the same effect regardless of the number of
>> processors used. Maybe I am mistaken. Anyway, to address this,
>> which mpirun gives me /usr/local/bin/mpirun.. so to configure
>> ./configure --with-mpi=/usr/local/bin/mpirun and to run
>> /usr/local/bin/mpirun -np X ... This should
>> uname -a gives me: Linux macmanes 2.6.31-16-generic #52-Ubuntu
>> SMP Thu Dec 3 22:07:16 UTC 2006 x86_64 GNU/Linux
>>
>> Matt
>>
>> On Dec 8, 2009, at 8:50 PM, Gus Correa wrote:
>>
>> Hi Matthew
>>
>> Please see comments/answers inline below.
>>
>> Matthew MacManes wrote:
>>
>> Hi Gus, Thanks for your ideas.. I have a few questions,
>> and will try to answer yours in hopes of solving this!!
>>
>>
>> A simple way to test OpenMPI on your system is to run the
>> test programs that come with the OpenMPI source code,
>> hello_c.c, connectivity_c.c, and ring_c.c:
>> http://www.open-mpi.org/
>>
>> Get the tarball from the OpenMPI site, gzip and untar it,
>> and look for it in the "examples" directory.
>> Compile it with /your/path/to/openmpi/bin/mpicc hello_c.c
>> Run it with /your/path/to/openmpi/bin/mpiexec -np X a.out
>> using X = 2, 4, 8, 16, 32, 64, ...
>>
>> This will tell if your OpenMPI is functional,
>> and if you can run on many Nehalem cores,
>> even with oversubscription perhaps.
>> It will also set the stage for further investigation of your
>> actual programs.
>>
>>
>> Should I worry about setting things like --num-cores
>> --bind-to-cores? This, I think, gets at your questions
>> about processor affinity.. Am I right? I could not
>> exactly figure out the -mca mpi-paffinity_alone stuff...
>>
>>
>> I use the simple minded -mca mpi-paffinity_alone 1.
>> This is probably the easiest way to assign a process to a core.
>> There more complex ways in OpenMPI, but I haven't tried.
>> Indeed, -mca mpi-paffinity_alone 1 does improve performance of
>> our programs here.
>> There is a chance that without it the 16 virtual cores of
>> your Nehalem get confused with more than 3 processes
>> (you reported that -np > 3 breaks).
>>
>> Did you try adding just -mca mpi-paffinity_alone 1 to
>> your mpiexec command line?
>>
>>
>> 1. Additional load: nope. nothing else, most of the time
>> not even firefox.
>>
>>
>> Good.
>> Turn off firefox, etc, to make it even better.
>> Ideally, use runlevel 3, no X, like a computer cluster node,
>> but this may not be required.
>>
>> 2. RAM: no problems apparent when monitoring through
>> TOP. Interesting, I did wonder about oversubscription,
>> so I tried the option --nooversubscription, but this
>> gave me an error mssage.
>>
>>
>> Oversubscription from your program would only happen if
>> you asked for more processes than available cores, i.e.,
>> -np > 8 (or "virtual" cores, in case of Nehalem hyperthreading,
>> -np > 16).
>> Since you have -np=4 there is no oversubscription,
>> unless you have other external load (e.g. Matlab, etc),
>> but you said you don't.
>>
>> Yet another possibility would be if your program is threaded
>> (e.g. using OpenMP along with MPI), but considering what you
>> said about OpenMP I would guess the programs don't use it.
>> For instance, you launch the program with 4 MPI processes,
>> and each process decides to start, say, 8 OpenMP threads.
>> You end up with 32 threads and 8 (real) cores (or 16
>> hyperthreaded
>> ones on Nehalem).
>>
>>
>> What else does top say?
>> Any hog processes (memory- or CPU-wise)
>> besides your program processes?
>>
>> 3. I have not tried other MPI flavors.. Ive been
>> speaking to the authors of the programs, and they are
>> both using openMPI.
>>
>> I was not trying to convince you to use another MPI.
>> I use MPICH2 also, but OpenMPI reigns here.
>> The idea or trying it with MPICH2 was just to check whether
>> OpenMPI
>> is causing the problem, but I don't think it is.
>>
>> 4. I don't think that this is a problem, as I'm
>> specifying --with-mpi=/usr/bin/... when I compile the
>> programs. Is there any other way to be sure that this is
>> not a problem?
>>
>>
>> Hmmm ....
>> I don't know about your Ubuntu (we have CentOS and Fedora on
>> various
>> machines).
>> However, most Linux distributions come with their MPI flavors,
>> and so do compilers, etc.
>> Often times they install these goodies in unexpected places,
>> and this has caused a lot of frustration.
>> There are tons of postings on this list that eventually
>> boiled down to mismatched versions of MPI in unexpected places.
>>
>>
>> The easy way is to use full path names to compile and to run.
>> Something like this:
>> /my/openmpi/bin/mpicc on your program configuration script),
>>
>> and something like this
>> /my/openmpi/bin/mpiexec -np ... bla, bla ...
>> when you submit the job.
>>
>> You can check your version with "which mpicc", "which mpiexec",
>> and (perhaps using full path names) with
>> "ompi_info", "mpicc --showme", "mpiexec --help".
>>
>>
>> 5. I had not been, and you could see some shuffling when
>> monitoring the load on specific processors. I have tried
>> to use --bind-to-cores to deal with this. I don't
>> understand how to use the -mca options you asked about.
>> 6. I am using Ubuntu 9.10. gcc 4.4.1 and g++ 4.4.1
>>
>>
>> I am afraid I won't be of help, because I don't have Nehalem.
>> However, I read about Nehalem requiring quite recent kernels
>> to get all of its features working right.
>>
>> What is the output of "uname -a"?
>> This will tell the kernel version, etc.
>> Other list subscribers may give you a suggestion if you post
>> the
>> information.
>>
>> MyBayes is a for bayesian phylogenetics:
>> http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page
>> ABySS: is a program for assembly of DNA sequence data:
>> http://www.bcgsc.ca/platform/bioinfo/software/abyss
>>
>>
>> Thanks for the links!
>> I had found the MrBayes link.
>> I eventually found what your ABySS was about, but no links.
>> Amazing that it is about DNA/gene sequencing.
>> Our abyss here is the deep ocean ... :)
>> Abysmal difference!
>>
>> Do the programs mix MPI (message passing) with
>> OpenMP (threads)?
>>
>> Im honestly not sure what this means..
>>
>>
>> Some programs mix the two.
>> OpenMP only works in a shared memory environment (e.g. a single
>> computer like yours), whereas MPI can use both shared memory
>> and work across a network (e.g. in a cluster).
>> There are other differences too.
>>
>> Unlikely that you have this hybrid type of parallel program,
>> otherwise there would be some reference to OpenMP
>> on the very program configuration files, program
>> documentation, etc.
>> Also, in general the configuration scripts of these hybrid
>> programs can turn on MPI only, or OpenMP only, or both,
>> depending on how you configure.
>>
>> Even to compile with OpenMP you would need a proper compiler
>> flag, but that one might be hidden in a Makefile too, making
>> a bit hard to find. "grep -n mp Makefile" may give a clue.
>> Anything on the documentation that mentions threads or OpenMP?
>>
>> FYI, here is OpenMP:
>> http://openmp.org/wp/
>>
>> Thanks for all your help!
>>
>> > Matt
>>
>> Well, so far it didn't really help. :(
>>
>> But let's hope to find a clue,
>> maybe with a little help of
>> our list subscriber friends.
>>
>> Gus Correa
>>
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>>
>> ---------------------------------------------------------------------
>>
>> Hi Matthew
>>
>> More guesses/questions than anything else:
>>
>> 1) Is there any additional load on this machine?
>> We had problems like that (on different machines) when
>> users start listening to streaming video, doing
>> Matlab calculations,
>> etc, while the MPI programs are running.
>> This tends to oversubscribe the cores, and may lead
>> to crashes.
>>
>> 2) RAM:
>> Can you monitor the RAM usage through "top"?
>> (I presume you are on Linux.)
>> It may show unexpected memory leaks, if they exist.
>>
>> On "top", type "1" (one) see all cores, type "f"
>> then "j"
>> to see the core number associated to each process.
>>
>> 3) Do the programs work right with other MPI flavors
>> (e.g. MPICH2)?
>> If not, then it is not OpenMPI's fault.
>>
>> 4) Any possibility that the MPI versions/flavors of
>> mpicc and
>> mpirun that you are using to compile and launch the
>> program are not the
>> same?
>>
>> 5) Are you setting processor affinity on mpiexec?
>>
>> mpiexec -mca mpi_paffinity_alone 1 -np ... bla, bla ...
>>
>> Context switching across the cores may also cause
>> trouble, I suppose.
>>
>> 6) Which Linux are you using (uname -a)?
>>
>> On other mailing lists I read reports that only
>> quite recent kernels
>> support all the Intel Nehalem processor features well.
>> I don't have Nehalem, I can't help here,
>> but the information may be useful
>> for other list subscribers to help you.
>>
>> ***
>>
>> As for the programs, some programs require specific
>> setup,
>> (and even specific compilation) when the number of
>> MPI processes
>> vary.
>> It may help if you tell us a link to the program sites.
>>
>> Baysian statistics is not totally out of our business,
>> but phylogenetic genetic trees is not really my league,
>> hence forgive me any bad guesses, please,
>> but would it need specific compilation or a different
>> set of input parameters to run correctly on a different
>> number of processors?
>> Do the programs mix MPI (message passing) with
>> OpenMP (threads)?
>>
>> I found this MrBayes, which seems to do the above:
>>
>> http://mrbayes.csit.fsu.edu/
>> http://mrbayes.csit.fsu.edu/wiki/index.php/Main_Page
>>
>> As for the ABySS, what is it, where can it be found?
>> Doesn't look like a deep ocean circulation model, as
>> the name suggest.
>>
>> My $0.02
>> Gus Correa
>>
>>
>> ------------------------------------------------------------------------
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _________________________________
>> Matthew MacManes
>> PhD Candidate
>> University of California- Berkeley
>> Museum of Vertebrate Zoology
>> Phone: 510-495-5833
>> Lab Website: http://ib.berkeley.edu/labs/lacey
>> Personal Website: http://macmanes.com/
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>