Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
From: Fischer, Greg A. (fischega_at_[hidden])
Date: 2014-01-22 10:21:54


Well, this is a little strange. The hanging behavior is gone, but I'm getting a segfault now. The output of "hello_c.c" and "ring_c.c" are attached.

I'm getting a segfault with the Fortran test, also. I'm afraid I may have polluted the experiment by removing the target openmpi-1.6.5 installation directory yesterday. To produce the attached outputs, I just went back and did "make install" in the openmpi-1.6.5 build directory. I've re-set the environment variables as they were a few days ago by sourcing the same bash script. Perhaps I forgot something, or something on the system changed? Regardless, LD_LIBRARY_PATH and PATH are set correctly, and aberrant behavior persists.

The reason for deleting the openmpi-1.6.5 installation was that I went back and installed openmpi-1.4.3 and the problem (mostly) went away. Openmpi-1.4.3 can run the simple tests without issue, but on my "real" program, I'm getting symbol lookup errors:

mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int

Perhaps that's a separate thread.

>-----Original Message-----
>From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Jeff
>Squyres (jsquyres)
>Sent: Tuesday, January 21, 2014 3:57 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
>consumes all system resources
>
>Just for giggles, can you repeat the same test but with hello_c.c and ring_c.c?
>I.e., let's get the Fortran out of the way and use just the base C bindings, and
>see what happens.
>
>
>On Jan 19, 2014, at 6:18 PM, "Fischer, Greg A." <fischega_at_[hidden]>
>wrote:
>
>> I just tried running "hello_f90.f90" and see the same behavior: 100% CPU
>usage, gradually increasing memory consumption, and failure to get past
>mpi_finalize. LD_LIBRARY_PATH is set as:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib
>>
>> The installation target for this version of OpenMPI is:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5
>>
>> 1045
>> fischega_at_lxlogin2[/data/fischega/petsc_configure/mpi_test/simple]>
>> which mpirun
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin/mpir
>> un
>>
>> Perhaps something strange is happening with GCC? I've tried simple hello
>world C and Fortran programs, and they work normally.
>>
>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph
>> Castain
>> Sent: Sunday, January 19, 2014 11:36 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize
>> and consumes all system resources
>>
>> The OFED warning about registration is something OMPI added at one point
>when we isolated the cause of jobs occasionally hanging, so you won't see
>that warning from other MPIs or earlier versions of OMPI (I forget exactly
>when we added it).
>>
>> The problem you describe doesn't sound like an OMPI issue - it sounds like
>you've got a memory corruption problem in the code. Have you tried running
>the examples in our example directory to confirm that the installation is
>good?
>>
>> Also, check to ensure that your LD_LIBRARY_PATH is correctly set to pickup
>the OMPI libs you installed - most Linux distros come with an older version,
>and that can cause problems if you inadvertently pick them up.
>>
>>
>> On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. <fischega_at_[hidden]>
>wrote:
>>
>>
>> Hello,
>>
>> I have a simple, 1-process test case that gets stuck on the mpi_finalize call.
>The test case is a dead-simple calculation of pi - 50 lines of Fortran. The
>process gradually consumes more and more memory until the system
>becomes unresponsive and needs to be rebooted, unless the job is killed
>first.
>>
>> In the output, attached, I see the warning message about OpenFabrics
>being configured to only allow registering part of physical memory. I've tried
>to chase this down with my administrator to no avail yet. (I am aware of the
>relevant FAQ entry.) A different installation of MPI on the same system,
>made with a different compiler, does not produce the OpenFabrics memory
>registration warning - which seems strange because I thought it was a system
>configuration issue independent of MPI. Also curious in the output is that LSF
>seems to think there are 7 processes and 11 threads associated with this job.
>>
>> The particulars of my configuration are attached and detailed below. Does
>anyone see anything potentially problematic?
>>
>> Thanks,
>> Greg
>>
>> OpenMPI Version: 1.6.5
>> Compiler: GCC 4.6.1
>> OS: SuSE Linux Enterprise Server 10, Patchlevel 2
>>
>> uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02
>> UTC 2008 x86_64 x86_64 x86_64 GNU/Linux
>>
>> LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-
>4.6.1/toolset/openmp
>> i-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/
>> lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib
>>
>> PATH=
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tool
>> s/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/ca
>> sl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles1
>> 0/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera
>> _clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/li
>> nux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x
>> 86_64/bin:/usr/bin:.:/bin:/usr/scripts
>>
>> Execution command: (executed via LSF - effectively "mpirun -np 1
>> test_program")
>>
><output.txt><config.log.bz2><ompi_info.bz2>___________________________
>> ____________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>--
>Jeff Squyres
>jsquyres_at_[hidden]
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>



  • application/octet-stream attachment: hello.out

  • application/octet-stream attachment: ring.out