Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openMPI shared with NFS, but says different version
From: Cristobal Navarro (axischire_at_[hidden])
Date: 2010-07-28 15:59:03


On Wed, Jul 28, 2010 at 3:28 PM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Cristobal
>
> Cristobal Navarro wrote:
>
>>
>>
>> On Wed, Jul 28, 2010 at 11:09 AM, Gus Correa <gus_at_[hidden]<mailto:
>> gus_at_[hidden]>> wrote:
>>
>> Hi Cristobal
>>
>> In case you are not using full path name for mpiexec/mpirun,
>> what does "which mpirun" say?
>>
>>
>> --> $which mpirun
>> /opt/openmpi-1.4.2
>>
>>
>> Often times this is a source of confusion, old versions may
>> be first on the PATH.
>>
>> Gus
>>
>>
>> openMPI version problem is now gone, i can confirm that the version is
>> consistent now :), thanks.
>>
>>
> This is good news.
>
>
> however, i keep getting this kernel crash randomnly when i execute with
>> -np higher than 5
>> these are Xeons, with Hyperthreading On, is that a problem??
>>
>>
> The problem may be with Hyperthreading, maybe not.
> Which Xeons?
>

--> they are not so old, not so new either
fcluster_at_agua:~$ cat /proc/cpuinfo | more
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
stepping : 5
cpu MHz : 1596.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss h
t tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_
cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida
tpr_shadow vnmi flexpriority ept vpid
bogomips : 4522.21
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
...same for cpu1, 2, 3, ..., 15.

information on how the cpu is distributed
fcluster_at_agua:~$ lstopo
System(7992MB)
  Socket#0 + L3(8192KB)
    L2(256KB) + L1(32KB) + Core#0
      P#0
      P#8
    L2(256KB) + L1(32KB) + Core#1
      P#2
      P#10
    L2(256KB) + L1(32KB) + Core#2
      P#4
      P#12
    L2(256KB) + L1(32KB) + Core#3
      P#6
      P#14
  Socket#1 + L3(8192KB)
    L2(256KB) + L1(32KB) + Core#0
      P#1
      P#9
    L2(256KB) + L1(32KB) + Core#1
      P#3
      P#11
    L2(256KB) + L1(32KB) + Core#2
      P#5
      P#13
    L2(256KB) + L1(32KB) + Core#3
      P#7
      P#15

> If I remember right, the old hyperthreading on old Xeons was problematic.
>
> OTOH, about 1-2 months ago I had trouble with OpenMPI on a relatively new
> Xeon Nehalem machine with (the new) Hyperthreading turned on,
> and Fedora Core 13.
> The machine would hang with the OpenMPI connectivity example.
> I reported this to the list, you may find in the archives.
>

--i foudn the archives recently about an hour ago, was not sure if it was
the same problem but i removed HT for testing with setting the online flag
to 0 on the extra cpus showed with lstopo, unfortenately i also crashes, so
HT may not be the problem.

> Apparently other people got everything (OpenMPI with HT on Nehalem)
> working in more stable distributions (CentOS, RHEL, etc).
>
> That problem was likely to be in the FC13 kernel,
> because even turning off HT I still had the machine hanging.
> Nothing worked with shared memory turned on,
> so I had to switch OpenMPI to use tcp instead,
> which is kind of ridiculous in a standalone machine.

--> very interesting, sm can be the problem

>
>
>
> im trying to locate the kernel error on logs, but after rebooting a crash,
>> the error is not in the kern.log (neither kern.log.1).
>> all i remember is that it starts with "Kernel BUG..."
>> and somepart it mentions a certain CPU X, where that cpu can be any from 0
>> to 15 (im testing only in main node). Someone knows where the log of kernel
>> error could be?
>>
>>
> Have you tried to turn off hyperthreading?
>

--> yes, tried, same crashes.

> In any case, depending on the application, it may not help much performance
> to have HT on.
>
> A more radical alternative is to try
> -mca btl tcp,self
> in the mpirun command line.
> That is what worked in the case I mentioned above.
>

wow!, this worked really :), you pointed out the problem, it was shared
memory.
i have 4 nodes, so anyways there will be node comunication, do you think i
can rely on working with -mca btl tcp,self?? i dont mind small lag.

i just have one more question, is this a problem of the ubuntu server
kernel?? from the Nehalem Cpus?? from openMPI (i dont think) ??

and on what depends that in the future, sm could be possible on the same
configuration i have?? kernel update?.

Thanks very much Gus, really!
Cristobal

>
> My $0.02
> Gus Correa
>
>
>> Cristobal Navarro wrote:
>>
>>
>> On Tue, Jul 27, 2010 at 7:29 PM, Gus Correa
>> <gus_at_[hidden] <mailto:gus_at_[hidden]>
>> <mailto:gus_at_[hidden] <mailto:gus_at_[hidden]>>>
>>
>> wrote:
>>
>> Hi Cristobal
>>
>> Does it run only on the head node alone?
>> (Fuego? Agua? Acatenango?)
>> Try to put only the head node on the hostfile and execute
>> with mpiexec.
>>
>> --> i will try only with the head node, and post results back
>> This may help sort out what is going on.
>> Hopefully it will run on the head node.
>>
>> Also, do you have Infinband connecting the nodes?
>> The error messages refer to the openib btl (i.e. Infiniband),
>> and complains of
>>
>>
>> no we are just using normal network 100MBit/s , since i am just
>> testing yet.
>>
>>
>> "perhaps a missing symbol, or compiled for a different
>> version of Open MPI?".
>> It sounds as a mixup of versions/builds.
>>
>>
>> --> i agree, somewhere there must be the remains of the older
>> version
>>
>> Did you configure/build OpenMPI from source, or did you install
>> it with apt-get?
>> It may be easier/less confusing to install from source.
>> If you did, what configure options did you use?
>>
>>
>> -->i installed from source, ./configure
>> --prefix=/opt/openmpi-1.4.2 --with-sge --without-xgid
>> --disable--static
>>
>> Also, as for the OpenMPI runtime environment,
>> it is not enough to set it on
>> the command line, because it will be effective only on the
>> head node.
>> You need to either add them to the PATH and LD_LIBRARY_PATH
>> on your .bashrc/.cshrc files (assuming these files and your home
>> directory are *also* shared with the nodes via NFS),
>> or use the --prefix option of mpiexec to point to the OpenMPI
>> main
>> directory.
>>
>>
>> yes, all nodes have their PATH and LD_LIBRARY_PATH set up
>> properly inside the login scripts ( .bashrc in my case )
>>
>> Needless to say, you need to check and ensure that the OpenMPI
>> directory (and maybe your home directory, and your work
>> directory)
>> is (are)
>> really mounted on the nodes.
>>
>>
>> --> yes, doublechecked that they are
>>
>> I hope this helps,
>>
>>
>> --> thanks really!
>>
>> Gus Correa
>>
>> Update: i just reinstalled openMPI, with the same parameters,
>> and it
>> seems that the problem has gone, i couldnt test entirely but
>> when i
>> get back to lab ill confirm.
>>
>> best regards! Cristobal
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>