Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-10-18 08:31:49


RC4 has been released with this fix and some others that should help
(http://www.open-mpi.org/software/v1.0/). Please let us know if it
fixes your problem.

On Oct 17, 2005, at 3:01 PM, Tim S. Woodall wrote:

> Hello Chris,
>
> Please give the next release candidate a try. There was an issue
> w/ the GM port that was likely causing this.
>
> Thanks,
> Tim
>
>
> Parrott, Chris wrote:
>> Greetings,
>>
>> I have been testing OpenMPI 1.0rc3 on a rack of 8 2-processor (single
>> core) Opteron systems connected via both Gigabit Ethernet and Myrinet.
>> My testing has been mostly successful, although I have run into a
>> recurring issue on a few MPI applications. The symptom is that the
>> computation seems to progress nearly to completion, and then suddenly
>> just hangs without terminating. One code that demonstrates this is
>> the
>> Tachyon parallel raytracer, available at:
>>
>> http://jedi.ks.uiuc.edu/~johns/raytracer/
>>
>> I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the
>> root
>> cause of this particular problem.
>>
>> I have attached the output of config.log to this message. Here is the
>> output from ompi_info:
>>
>> Open MPI: 1.0rc3r7730
>> Open MPI SVN revision: r7730
>> Open RTE: 1.0rc3r7730
>> Open RTE SVN revision: r7730
>> OPAL: 1.0rc3r7730
>> OPAL SVN revision: r7730
>> Prefix: /opt/openmpi-1.0rc3-pgi-6.0
>> Configured architecture: x86_64-unknown-linux-gnu
>> Configured by: root
>> Configured on: Mon Oct 17 10:10:28 PDT 2005
>> Configure host: castor00
>> Built by: root
>> Built on: Mon Oct 17 10:29:20 PDT 2005
>> Built host: castor00
>> C bindings: yes
>> C++ bindings: yes
>> Fortran77 bindings: yes (all)
>> Fortran90 bindings: yes
>> C compiler: pgcc
>> C compiler absolute:
>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc
>> C++ compiler: pgCC
>> C++ compiler absolute:
>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC
>> Fortran77 compiler: pgf77
>> Fortran77 compiler abs:
>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77
>> Fortran90 compiler: pgf90
>> Fortran90 compiler abs:
>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90
>> C profiling: yes
>> C++ profiling: yes
>> Fortran77 profiling: yes
>> Fortran90 profiling: yes
>> C++ exceptions: no
>> Thread support: posix (mpi: no, progress: no)
>> Internal debug support: no
>> MPI parameter check: runtime
>> Memory profiling support: no
>> Memory debugging support: no
>> libltdl support: 1
>> MCA memory: malloc_hooks (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)
>> MCA maffinity: first_use (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0)
>> MCA timer: linux (MCA v1.0, API v1.0, Component v1.0)
>> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>> MCA coll: basic (MCA v1.0, API v1.0, Component v1.0)
>> MCA coll: self (MCA v1.0, API v1.0, Component v1.0)
>> MCA coll: sm (MCA v1.0, API v1.0, Component v1.0)
>> MCA io: romio (MCA v1.0, API v1.0, Component v1.0)
>> MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0)
>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0)
>> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0)
>> MCA pml: teg (MCA v1.0, API v1.0, Component v1.0)
>> MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0)
>> MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0)
>> MCA ptl: self (MCA v1.0, API v1.0, Component v1.0)
>> MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0)
>> MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA btl: gm (MCA v1.0, API v1.0, Component v1.0)
>> MCA btl: self (MCA v1.0, API v1.0, Component v1.0)
>> MCA btl: sm (MCA v1.0, API v1.0, Component v1.0)
>> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA topo: unity (MCA v1.0, API v1.0, Component v1.0)
>> MCA gpr: null (MCA v1.0, API v1.0, Component v1.0)
>> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0)
>> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.0)
>> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0)
>> MCA iof: svc (MCA v1.0, API v1.0, Component v1.0)
>> MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0)
>> MCA ns: replica (MCA v1.0, API v1.0, Component v1.0)
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA ras: dash_host (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA ras: hostfile (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA ras: localhost (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0)
>> MCA rds: hostfile (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA rds: resfile (MCA v1.0, API v1.0, Component v1.0)
>> MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0)
>> MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.0)
>> MCA pls: daemon (MCA v1.0, API v1.0, Component v1.0)
>> MCA pls: fork (MCA v1.0, API v1.0, Component v1.0)
>> MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0)
>> MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0)
>> MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0)
>> MCA sds: env (MCA v1.0, API v1.0, Component v1.0)
>> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0)
>> MCA sds: seed (MCA v1.0, API v1.0, Component v1.0)
>> MCA sds: singleton (MCA v1.0, API v1.0, Component
>> v1.0)
>> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0)
>>
>>
>> Here is the command-line I am using to invoke OpenMPI for my build of
>> Tachyon:
>>
>> /opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix
>> /opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile
>> hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat
>>
>> Attaching gdb to one of the hung processes, I get the following stack
>> trace:
>>
>> (gdb) bt
>> #0 0x0000002a95d6b87d in opal_sys_timer_get_cycles ()
>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
>> #1 0x0000002a95d83509 in opal_timer_base_get_cycles ()
>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
>> #2 0x0000002a95d8370c in opal_progress ()
>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
>> #3 0x0000002a95a6d8a5 in opal_condition_wait ()
>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
>> #4 0x0000002a95a6de49 in ompi_request_wait_all ()
>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
>> #5 0x0000002a95937602 in PMPI_Waitall ()
>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
>> #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60)
>> at parallel.c:229
>> #7 0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285
>> #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
>> api.c:95
>> #9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at
>> main.c:431
>> (gdb)
>>
>> So based on this stack trace, it appears that the application is
>> hanging
>> on an MPI_Waitall call for some reason.
>>
>> Does anyone have any ideas as to why this might be happening? If this
>> is covered in the FAQ somewhere, then please accept my apologies in
>> advance.
>>
>> Many thanks,
>>
>> +chris
>>
>> --
>> Chris Parrott 5204 E. Ben White Blvd., M/S 628
>> Product Development Engineer Austin, TX 78741
>> Computational Products Group (512) 602-8710 / (512) 602-7745 (fax)
>> Advanced Micro Devices chris.parrott_at_[hidden]
>>
>>
>> ----------------------------------------------------------------------
>> --
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/