Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Program hangs
From: Jiaye Li (jameslipd_at_[hidden])
Date: 2009-11-23 20:24:10


Dear Eugene

I am sorry that I may not explain the problem clearly last time. The problem
is that I tested Ompi with PWscf program on one quadcore node. At the
initial several hours, the program went on quite well. When the electronic
scf is going to converge, the program started to hang. For example it hangs
at the first scf iteration of bfgs steps =23. I waited another 10 hours for
the program to go on, but in vain

The kernel is 2.6.29.4-167.fc11.i686.PAE
The following is the compiler I used to install Ompi. I configured Ompi with
options of CC=gcc, FC=ifort.

******************************
********************************************************************************
intel-icc101018-10.1.018-1.i386
libgcc-4.4.0-4.i586
gcc-4.4.0-4.i586
gcc-gfortran-4.4.0-4.i586
gcc-c++-4.4.0-4.i586
intel-ifort101018-10.1.018-1.i386

and the architecture is:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
stepping : 10
cpu MHz : 2825.937
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc
arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16
xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5651.87
clflush size : 64
power management:

**************************************************************************************************************

On Tue, Nov 24, 2009 at 7:27 AM, Eugene Loh <Eugene.Loh_at_[hidden]> wrote:

> I can't tell if these problems are related to trac ticket 2043 or not.
>
> Compiler: In my experience, trac 2043 depends on GCC 4.4.x. It isn't
> necessarily a GCC bug... perhaps it's just exposing an OMPI problem. I'm
> confused what compiler Jiaye is using, and Vasilis is apparently seeing a
> problem when using the PGI compiler. But, maybe other compilers in
> addition to GCC 4.4.x are exposing the problem.
>
> Severity: In my experience, trac 2043 shows up rather dramatically:
> within dozens to hundreds of iterations of simple message patterns. So, a
> problem that shows up only after hours of execution feels to me to be
> something different. But maybe I misunderstand Jiaye's and Vasili's cases:
> are the programs running well for several hours before the hang occurs?
>
> Shared memory: Trac 2043 appears related to shared memory. Jiaye seems to
> run on a single node. Vasilis talks of running on a "cluster" -- so I don't
> know if that means over an interconnect or still using sm.
>
> Anyhow, it's hard to know which problems are the same or different when we
> don't yet really understand what's going on.
>
> vasilis gkanis wrote:
>
> I also experience a similar problem with the MUMPS solver, when I run it
>> on a cluster. After several hours of running the code does not produce any
>> results, although the command top shows that the program occupies 100% of
>> the CPU.
>>
>> The difference here, however, is that the same program runs fine on my PC.
>> The differences between my PC and the cluster are:
>> 1) 32bit vs 64-bit(cluster)
>> 2) intel compiler vs portland compiler(cluster)
>>
>> On Friday 20 November 2009 03:50:17 am Jiaye Li wrote:
>>
>>
>>> I installed openmpi-1.3.3 on my single node(cpu) intel 64bit quad-core
>>> machine. The compiler info is:
>>>
>>>
>>> ***************************************************************************
>>> *********************************** intel-icc101018-10.1.018-1.i386
>>> libgcc-4.4.0-4.i586
>>> gcc-4.4.0-4.i586
>>> gcc-gfortran-4.4.0-4.i586
>>> gcc-c++-4.4.0-4.i586
>>> intel-ifort101018-10.1.018-1.i386
>>>
>>>
>>> ***************************************************************************
>>> ***********************************
>>>
>>> I compiled PWscf program with openmpi and tested the program. At the
>>> beginning, the execution of PW went on well, but after about 10 h, when
>>> the program is going to finish. The program hang there, but the cpu time
>>> is still occupied. (100% taken up by the program). There seems to be
>>> something wrong, somewhere. Any ideas? Thank you in advance.
>>>
>>>
>>

-- 
Sincerely yours
Jiaye Li



  • application/octet-stream attachment: Makefile