Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openMPI shared with NFS, but says different version
From: Gus Correa (gus_at_[hidden])
Date: 2010-08-10 17:57:56


Thank you, Cristobal.
That is good news.

Gus Correa

Cristobal Navarro wrote:
> i have good news.
>
> after updating to a newer kernel on ubuntu server nodes, sm is not a
> problem anymore for the nehalem CPUs!!!
> my older kernel, was
> Linux 2.6.32-22-server #36-Ubuntu SMP Thu Jun 3 20:38:33 UTC 2010 x86_64
> GNU/Linux
>
> and i upgraded to
> Linux agua 2.6.32-24-server #39-Ubuntu SMP Wed Jul 28 06:21:40 UTC 2010
> x86_64 GNU/Linux
>
> that solved everything.
> Gus, maybe the problem you had with fedora can be solved in a similar way.
>
> we should keep this for the records.
>
> regards
> Cristobal
>
>
>
>
>
>
> On Wed, Jul 28, 2010 at 6:45 PM, Gus Correa <gus_at_[hidden]
> <mailto:gus_at_[hidden]>> wrote:
>
> Cristobal Navarro wrote:
>
> Gus
> my kernel for all nodes is this one:
> Linux 2.6.32-22-server #36-Ubuntu SMP Thu Jun 3 20:38:33 UTC
> 2010 x86_64 GNU/Linux
>
>
> Kernel is not my league.
>
> However, it would be great if somebody clarified
> for good these issues with Nehalem/Westmere, HT,
> shared memory and what the kernel is doing,
> or how to make the kernel do the right thing.
> Maybe Intel could tell.
>
>
> at least for the moment i will use this configuration, at least
> for deveplopment/testing of the parallel programs.
> lag is minimum :)
>
> whenever i get another kernel update, i will test again to check
> if sm works, would be good to know that suddenly another
> distribution supports nehalem sm.
>
> best regards and thanks again
> Cristobal
> ps: guess what are the names of the other 2 nodes lol
>
>
> Acatenango (I said that before), and Pacaya.
>
> Maybe: Santa Maria, Santiaguito, Atitlan, Toliman, San Pedro,
> Cerro de Oro ... too many volcanoes, and some are multithreaded ...
> You need to buy more nodes!
>
> Gus
>
>
>
>
> On Wed, Jul 28, 2010 at 5:50 PM, Gus Correa
> <gus_at_[hidden] <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden] <mailto:gus_at_[hidden]>>>
> wrote:
>
> Hi Cristobal
>
> Please, read my answer (way down the message) below.
>
> Cristobal Navarro wrote:
>
>
>
> On Wed, Jul 28, 2010 at 3:28 PM, Gus Correa
> <gus_at_[hidden] <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden] <mailto:gus_at_[hidden]>>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>>>
> wrote:
>
> Hi Cristobal
>
> Cristobal Navarro wrote:
>
>
>
> On Wed, Jul 28, 2010 at 11:09 AM, Gus Correa
> <gus_at_[hidden]
> <mailto:gus_at_[hidden]> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>>>>
> wrote:
>
> Hi Cristobal
>
> In case you are not using full path name for
> mpiexec/mpirun,
> what does "which mpirun" say?
>
>
> --> $which mpirun
> /opt/openmpi-1.4.2
>
>
> Often times this is a source of confusion, old
> versions may
> be first on the PATH.
>
> Gus
>
>
> openMPI version problem is now gone, i can confirm
> that the
> version is consistent now :), thanks.
>
>
> This is good news.
>
>
> however, i keep getting this kernel crash
> randomnly when i
> execute with -np higher than 5
> these are Xeons, with Hyperthreading On, is that a
> problem??
>
>
> The problem may be with Hyperthreading, maybe not.
> Which Xeons?
>
>
> --> they are not so old, not so new either
> fcluster_at_agua:~$ cat /proc/cpuinfo | more
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 26
> model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
> stepping : 5
> cpu MHz : 1596.000
> cache size : 8192 KB
> physical id : 0
> siblings : 8
> core id : 0
> cpu cores : 4
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 11
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss h
> t tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon
> pebs bts
> rep_good xtopology nonstop_tsc aperfmperf pni dtes64
> monitor ds_
> cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt
> lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
> bogomips : 4522.21
> clflush size : 64
> cache_alignment : 64
> address sizes : 40 bits physical, 48 bits virtual
> power management:
> ...same for cpu1, 2, 3, ..., 15.
>
>
> AHA! Nehalems!
>
> Here they are E5540, just a different clock speed, I suppose.
>
>
> information on how the cpu is distributed
> fcluster_at_agua:~$ lstopo
> System(7992MB)
> Socket#0 + L3(8192KB)
> L2(256KB) + L1(32KB) + Core#0
> P#0
> P#8
> L2(256KB) + L1(32KB) + Core#1
> P#2
> P#10
> L2(256KB) + L1(32KB) + Core#2
> P#4
> P#12
> L2(256KB) + L1(32KB) + Core#3
> P#6
> P#14
> Socket#1 + L3(8192KB)
> L2(256KB) + L1(32KB) + Core#0
> P#1
> P#9
> L2(256KB) + L1(32KB) + Core#1
> P#3
> P#11
> L2(256KB) + L1(32KB) + Core#2
> P#5
> P#13
> L2(256KB) + L1(32KB) + Core#3
> P#7
> P#15
>
>
>
> If I remember right, the old hyperthreading
> on old Xeons was
> problematic.
>
> OTOH, about 1-2 months ago I had trouble with OpenMPI on a
> relatively new Xeon Nehalem machine with (the new)
> Hyperthreading
> turned on,
> and Fedora Core 13.
> The machine would hang with the OpenMPI connectivity
> example.
> I reported this to the list, you may find in the archives.
>
>
> --i foudn the archives recently about an hour ago, was
> not sure
> if it was the same problem but i removed HT for testing with
> setting the online flag to 0 on the extra cpus showed with
> lstopo, unfortenately i also crashes, so HT may not be
> the problem.
>
>
> It didn't fix the problem in our Nehalem machine here either,
> although it was FC13, and I don't know what OS and kernel
> you're using.
>
>
> Apparently other people got everything (OpenMPI with HT on
> Nehalem)
> working in more stable distributions (CentOS, RHEL, etc).
>
> That problem was likely to be in the FC13 kernel,
> because even turning off HT I still had the machine
> hanging.
> Nothing worked with shared memory turned on,
> so I had to switch OpenMPI to use tcp instead,
> which is kind of ridiculous in a standalone machine.
>
>
> --> very interesting, sm can be the problem
>
>
>
> im trying to locate the kernel error on logs, but
> after
> rebooting a crash, the error is not in the
> kern.log (neither
> kern.log.1).
> all i remember is that it starts with "Kernel BUG..."
> and somepart it mentions a certain CPU X, where
> that cpu
> can be
> any from 0 to 15 (im testing only in main node).
> Someone
> knows
> where the log of kernel error could be?
>
>
> Have you tried to turn off hyperthreading?
>
>
> --> yes, tried, same crashes.
> In any case, depending on the application, it
> may not help much
> performance to have HT on.
>
> A more radical alternative is to try
> -mca btl tcp,self
> in the mpirun command line.
> That is what worked in the case I mentioned above.
>
>
> wow!, this worked really :), you pointed out the problem, it
> was shared memory.
>
>
> Great news!
> That's exactly the problem we had here.
> Glad that the same solution worked for you.
>
> Over a year ago another fellow reported the same problem on
> Nehalem,
> on the very early days of Nehalem.
> The thread should be in the archives.
> Somebody back then (Ralph, or Jeff, or other?)
> suggested that turning off "sm" might work.
> So, I take no credit for this.
>
>
> i have 4 nodes, so anyways there will be node
> comunication, do
> you think i can rely on working with -mca btl tcp,self??
> i dont
> mind small lag.
>
>
> Well, this may be it, short from reinstalling the OS.
>
> Some people reported everything works with OpenMPI+HT+sm in
> CentOS
> and RHEL, see the thread I mentioned in the archives from 1-2
> months
> ago.
> I don't administer that machine, and didn't have the time to
> do OS
> reinstall either.
> So I left it with -mca btl tcp,self, and the user/machine owner
> is happy that he can run his programs right,
> and with a performance that he considers good.
>
>
> i just have one more question, is this a problem of the
> ubuntu
> server kernel?? from the Nehalem Cpus?? from openMPI (i dont
> think) ??
>
>
> I don't have any idea.
> It may be a problem with some kernels, not sure.
> Which kernel do you have?
>
> Ours was FC-13, maybe FC-12, I don't remember exactly.
> Currently that machine has kernel 2.6.33.6-147.fc13.x86_64 #1
> SMP.
> However, it may have been a slightly older kernel when I
> installed
> OpenMPI there.
> It may have been 2.6.33.5-124.fc13.x86_64 or
> 2.6.32.14-127.fc12.x86_64.
> My colleague here updates the machines with yum,
> so it may have gotten a new kernel since then.
>
> Our workhorse machines in the clusters that I take care
> of are AMD Opteron, never had this problem there.
> Maybe the kernels have yet to catch up with Nehalem,
> now Westmere, soon another one.
>
>
> and on what depends that in the future, sm could be
> possible on
> the same configuration i have?? kernel update?.
>
>
> You may want to try CentOS or RHEL, but I can't guarantee the
> results.
> Somebody else in the list may have had the direct experience,
> and may speak out.
>
> It may be worth the effort anyway.
> After all, intra-node communication should be
> running on shared memory.
> Having to turn it off is outrageous.
>
> If you try another OS distribution,
> and if it works, please report the results back to the list:
> OS/distro, kernel, OpenMPI version, HT on or off,
> mca btl sm/tcp/self/etc choices, compilers, etc.
> This type of information is a real time saving for everybody.
>
>
>
> Thanks very much Gus, really!
> Cristobal
>
>
>
> My pleasure.
> Glad that there was a solution, even though not the best.
> Enjoy your cluster with vocano-named nodes!
> Have fun with OpenMPI and PETSc!
>
> Gus Correa
>
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
>
> ---------------------------------------------------------------------
>
>
> My $0.02
> Gus Correa
>
>
> Cristobal Navarro wrote:
>
>
> On Tue, Jul 27, 2010 at 7:29 PM, Gus Correa
> <gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>>>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden] <mailto:gus_at_[hidden]>>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden] <mailto:gus_at_[hidden]>>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>
> <mailto:gus_at_[hidden]
> <mailto:gus_at_[hidden]>>>>>>
>
> wrote:
>
> Hi Cristobal
>
> Does it run only on the head node alone?
> (Fuego? Agua? Acatenango?)
> Try to put only the head node on the
> hostfile
> and execute
> with mpiexec.
>
> --> i will try only with the head node, and
> post
> results back
> This may help sort out what is going on.
> Hopefully it will run on the head node.
>
> Also, do you have Infinband connecting
> the nodes?
> The error messages refer to the openib
> btl (i.e.
> Infiniband),
> and complains of
>
>
> no we are just using normal network 100MBit/s ,
> since i
> am just
> testing yet.
>
>
> "perhaps a missing symbol, or compiled for a
> different
> version of Open MPI?".
> It sounds as a mixup of versions/builds.
>
>
> --> i agree, somewhere there must be the
> remains
> of the older
> version
>
> Did you configure/build OpenMPI from
> source, or did
> you install
> it with apt-get?
> It may be easier/less confusing to
> install from
> source.
> If you did, what configure options did
> you use?
>
>
> -->i installed from source, ./configure
> --prefix=/opt/openmpi-1.4.2 --with-sge
> --without-xgid
> --disable--static
>
> Also, as for the OpenMPI runtime
> environment,
> it is not enough to set it on
> the command line, because it will be
> effective
> only on the
> head node.
> You need to either add them to the PATH and
> LD_LIBRARY_PATH
> on your .bashrc/.cshrc files (assuming these
> files and
> your home
> directory are *also* shared with the
> nodes via
> NFS),
> or use the --prefix option of mpiexec to
> point
> to the
> OpenMPI
> main
> directory.
>
>
> yes, all nodes have their PATH and
> LD_LIBRARY_PATH
> set up
> properly inside the login scripts ( .bashrc
> in my
> case )
>
> Needless to say, you need to check and
> ensure
> that the
> OpenMPI
> directory (and maybe your home
> directory, and
> your work
> directory)
> is (are)
> really mounted on the nodes.
>
>
> --> yes, doublechecked that they are
>
> I hope this helps,
>
>
> --> thanks really!
>
> Gus Correa
>
> Update: i just reinstalled openMPI, with
> the same
> parameters,
> and it
> seems that the problem has gone, i
> couldnt test
> entirely but
> when i
> get back to lab ill confirm.
>
> best regards! Cristobal
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> <mailto:users_at_[hidden]> <mailto:users_at_[hidden]
> <mailto:users_at_[hidden]>>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>>
> <mailto:users_at_[hidden]
> <mailto:users_at_[hidden]> <mailto:users_at_[hidden]
> <mailto:users_at_[hidden]>>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>>>
>
>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>>
> <mailto:users_at_[hidden]
> <mailto:users_at_[hidden]> <mailto:users_at_[hidden]
> <mailto:users_at_[hidden]>>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>>>
>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users