Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
From: Fischer, Greg A. (fischega_at_[hidden])
Date: 2014-01-22 10:21:54


Well, this is a little strange. The hanging behavior is gone, but I'm getting a segfault now. The output of "hello_c.c" and "ring_c.c" are attached.

I'm getting a segfault with the Fortran test, also. I'm afraid I may have polluted the experiment by removing the target openmpi-1.6.5 installation directory yesterday. To produce the attached outputs, I just went back and did "make install" in the openmpi-1.6.5 build directory. I've re-set the environment variables as they were a few days ago by sourcing the same bash script. Perhaps I forgot something, or something on the system changed? Regardless, LD_LIBRARY_PATH and PATH are set correctly, and aberrant behavior persists.

The reason for deleting the openmpi-1.6.5 installation was that I went back and installed openmpi-1.4.3 and the problem (mostly) went away. Openmpi-1.4.3 can run the simple tests without issue, but on my "real" program, I'm getting symbol lookup errors:

mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int

Perhaps that's a separate thread.

>-----Original Message-----
>From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff
>Squyres (jsquyres)
>Sent: Tuesday, January 21, 2014 3:57 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
>consumes all system resources
>
>Just for giggles, can you repeat the same test but with hello_c.c and ring_c.c?
>I.e., let's get the Fortran out of the way and use just the base C bindings, and
>see what happens.
>
>
>On Jan 19, 2014, at 6:18 PM, "Fischer, Greg A." <fischega_at_[hidden]>
>wrote:
>
>> I just tried running "hello_f90.f90" and see the same behavior: 100% CPU
>usage, gradually increasing memory consumption, and failure to get past
>mpi_finalize. LD_LIBRARY_PATH is set as:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib
>>
>> The installation target for this version of OpenMPI is:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5
>>
>> 1045
>> fischega_at_lxlogin2[/data/fischega/petsc_configure/mpi_test/simple]>
>> which mpirun
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin/mpir
>> un
>>
>> Perhaps something strange is happening with GCC? I've tried simple hello
>world C and Fortran programs, and they work normally.
>>
>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph
>> Castain
>> Sent: Sunday, January 19, 2014 11:36 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize
>> and consumes all system resources
>>
>> The OFED warning about registration is something OMPI added at one point
>when we isolated the cause of jobs occasionally hanging, so you won't see
>that warning from other MPIs or earlier versions of OMPI (I forget exactly
>when we added it).
>>
>> The problem you describe doesn't sound like an OMPI issue - it sounds like
>you've got a memory corruption problem in the code. Have you tried running
>the examples in our example directory to confirm that the installation is
>good?
>>
>> Also, check to ensure that your LD_LIBRARY_PATH is correctly set to pickup
>the OMPI libs you installed - most Linux distros come with an older version,
>and that can cause problems if you inadvertently pick them up.
>>
>>
>> On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. <fischega_at_[hidden]>
>wrote:
>>
>>
>> Hello,
>>
>> I have a simple, 1-process test case that gets stuck on the mpi_finalize call.
>The test case is a dead-simple calculation of pi - 50 lines of Fortran. The
>process gradually consumes more and more memory until the system
>becomes unresponsive and needs to be rebooted, unless the job is killed
>first.
>>
>> In the output, attached, I see the warning message about OpenFabrics
>being configured to only allow registering part of physical memory. I've tried
>to chase this down with my administrator to no avail yet. (I am aware of the
>relevant FAQ entry.) A different installation of MPI on the same system,
>made with a different compiler, does not produce the OpenFabrics memory
>registration warning - which seems strange because I thought it was a system
>configuration issue independent of MPI. Also curious in the output is that LSF
>seems to think there are 7 processes and 11 threads associated with this job.
>>
>> The particulars of my configuration are attached and detailed below. Does
>anyone see anything potentially problematic?
>>
>> Thanks,
>> Greg
>>
>> OpenMPI Version: 1.6.5
>> Compiler: GCC 4.6.1
>> OS: SuSE Linux Enterprise Server 10, Patchlevel 2
>>
>> uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02
>> UTC 2008 x86_64 x86_64 x86_64 GNU/Linux
>>
>> LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-
>4.6.1/toolset/openmp
>> i-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/
>> lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib
>>
>> PATH=
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tool
>> s/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/ca
>> sl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles1
>> 0/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera
>> _clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/li
>> nux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x
>> 86_64/bin:/usr/bin:.:/bin:/usr/scripts
>>
>> Execution command: (executed via LSF - effectively "mpirun -np 1
>> test_program")
>>
><output.txt><config.log.bz2><ompi_info.bz2>___________________________
>> ____________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>--
>Jeff Squyres
>jsquyres_at_[hidden]
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>



  • application/octet-stream attachment: hello.out

  • application/octet-stream attachment: ring.out