Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Tim S. Woodall (twoodall_at_[hidden])
Date: 2005-10-17 14:01:02


Hello Chris,

Please give the next release candidate a try. There was an issue
w/ the GM port that was likely causing this.

Thanks,
Tim

Parrott, Chris wrote:
> Greetings,
>
> I have been testing OpenMPI 1.0rc3 on a rack of 8 2-processor (single
> core) Opteron systems connected via both Gigabit Ethernet and Myrinet.
> My testing has been mostly successful, although I have run into a
> recurring issue on a few MPI applications. The symptom is that the
> computation seems to progress nearly to completion, and then suddenly
> just hangs without terminating. One code that demonstrates this is the
> Tachyon parallel raytracer, available at:
>
> http://jedi.ks.uiuc.edu/~johns/raytracer/
>
> I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the root
> cause of this particular problem.
>
> I have attached the output of config.log to this message. Here is the
> output from ompi_info:
>
> Open MPI: 1.0rc3r7730
> Open MPI SVN revision: r7730
> Open RTE: 1.0rc3r7730
> Open RTE SVN revision: r7730
> OPAL: 1.0rc3r7730
> OPAL SVN revision: r7730
> Prefix: /opt/openmpi-1.0rc3-pgi-6.0
> Configured architecture: x86_64-unknown-linux-gnu
> Configured by: root
> Configured on: Mon Oct 17 10:10:28 PDT 2005
> Configure host: castor00
> Built by: root
> Built on: Mon Oct 17 10:29:20 PDT 2005
> Built host: castor00
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> C compiler: pgcc
> C compiler absolute:
> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc
> C++ compiler: pgCC
> C++ compiler absolute:
> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC
> Fortran77 compiler: pgf77
> Fortran77 compiler abs:
> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77
> Fortran90 compiler: pgf90
> Fortran90 compiler abs:
> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: no, progress: no)
> Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: 1
> MCA memory: malloc_hooks (MCA v1.0, API v1.0, Component
> v1.0)
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.0)
> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0)
> MCA timer: linux (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.0)
> MCA io: romio (MCA v1.0, API v1.0, Component v1.0)
> MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0)
> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0)
> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0)
> MCA pml: teg (MCA v1.0, API v1.0, Component v1.0)
> MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0)
> MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0)
> MCA ptl: self (MCA v1.0, API v1.0, Component v1.0)
> MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0)
> MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA btl: gm (MCA v1.0, API v1.0, Component v1.0)
> MCA btl: self (MCA v1.0, API v1.0, Component v1.0)
> MCA btl: sm (MCA v1.0, API v1.0, Component v1.0)
> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA topo: unity (MCA v1.0, API v1.0, Component v1.0)
> MCA gpr: null (MCA v1.0, API v1.0, Component v1.0)
> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0)
> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.0)
> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0)
> MCA iof: svc (MCA v1.0, API v1.0, Component v1.0)
> MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0)
> MCA ns: replica (MCA v1.0, API v1.0, Component v1.0)
> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: localhost (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0)
> MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.0)
> MCA rds: resfile (MCA v1.0, API v1.0, Component v1.0)
> MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
> v1.0)
> MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0)
> MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0)
> MCA rml: oob (MCA v1.0, API v1.0, Component v1.0)
> MCA pls: daemon (MCA v1.0, API v1.0, Component v1.0)
> MCA pls: fork (MCA v1.0, API v1.0, Component v1.0)
> MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0)
> MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0)
> MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0)
> MCA sds: env (MCA v1.0, API v1.0, Component v1.0)
> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0)
> MCA sds: seed (MCA v1.0, API v1.0, Component v1.0)
> MCA sds: singleton (MCA v1.0, API v1.0, Component v1.0)
> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0)
>
>
> Here is the command-line I am using to invoke OpenMPI for my build of
> Tachyon:
>
> /opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix
> /opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile
> hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat
>
> Attaching gdb to one of the hung processes, I get the following stack
> trace:
>
> (gdb) bt
> #0 0x0000002a95d6b87d in opal_sys_timer_get_cycles ()
> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> #1 0x0000002a95d83509 in opal_timer_base_get_cycles ()
> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> #2 0x0000002a95d8370c in opal_progress ()
> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> #3 0x0000002a95a6d8a5 in opal_condition_wait ()
> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> #4 0x0000002a95a6de49 in ompi_request_wait_all ()
> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> #5 0x0000002a95937602 in PMPI_Waitall ()
> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60)
> at parallel.c:229
> #7 0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285
> #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
> api.c:95
> #9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at main.c:431
> (gdb)
>
> So based on this stack trace, it appears that the application is hanging
> on an MPI_Waitall call for some reason.
>
> Does anyone have any ideas as to why this might be happening? If this
> is covered in the FAQ somewhere, then please accept my apologies in
> advance.
>
> Many thanks,
>
> +chris
>
> --
> Chris Parrott 5204 E. Ben White Blvd., M/S 628
> Product Development Engineer Austin, TX 78741
> Computational Products Group (512) 602-8710 / (512) 602-7745 (fax)
> Advanced Micro Devices chris.parrott_at_[hidden]
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users