Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Parrott, Chris (chris.parrott_at_[hidden])
Date: 2005-10-18 13:24:55


Tim,

I just tried this same code again with 1.0rc4, and I still see the same
symptom. The gdb stack trace for a hung process looks a bit different
this time, however:

(gdb) bt
#0 0x0000002a98a085e1 in mca_bml_r2_progress ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_bml_r2.so
#1 0x0000002a986c3080 in mca_pml_ob1_progress ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_pml_ob1.so
#2 0x0000002a95d8378c in opal_progress ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libopal.so.0
#3 0x0000002a95a6d8a5 in opal_condition_wait ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#4 0x0000002a95a6de49 in ompi_request_wait_all ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#5 0x0000002a95937602 in PMPI_Waitall ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635e10)
    at parallel.c:229
#7 0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285
#8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
api.c:95
#9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at main.c:431
(gdb)

It still seems to be stuck in the MPI_Waitall call, for some reason.

Any ideas? If you need any additional information from me, please let
me know.

Thanks in advance,

+chris

--
Chris Parrott                    5204 E. Ben White Blvd., M/S 628
Product Development Engineer     Austin, TX 78741
Computational Products Group     (512) 602-8710 / (512) 602-7745 (fax)
Advanced Micro Devices           chris.parrott_at_[hidden]
> -----Original Message-----
> From: Tim S. Woodall [mailto:twoodall_at_[hidden]] 
> Sent: Monday, October 17, 2005 2:01 PM
> To: Open MPI Users
> Cc: Parrott, Chris
> Subject: Re: [O-MPI users] OpenMPI hang issue
> 
> 
> Hello Chris,
> 
> Please give the next release candidate a try. There was an 
> issue w/ the GM port that was likely causing this.
> 
> Thanks,
> Tim
> 
> 
> Parrott, Chris wrote:
> > Greetings,
> > 
> > I have been testing OpenMPI 1.0rc3 on a rack of 8 
> 2-processor (single
> > core) Opteron systems connected via both Gigabit Ethernet 
> and Myrinet. 
> > My testing has been mostly successful, although I have run into a 
> > recurring issue on a few MPI applications.  The symptom is that the 
> > computation seems to progress nearly to completion, and 
> then suddenly 
> > just hangs without terminating.  One code that demonstrates this is 
> > the Tachyon parallel raytracer, available at:
> > 
> >   http://jedi.ks.uiuc.edu/~johns/raytracer/
> > 
> > I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the 
> > root cause of this particular problem.
> > 
> > I have attached the output of config.log to this message.  
> Here is the 
> > output from ompi_info:
> > 
> >                 Open MPI: 1.0rc3r7730
> >    Open MPI SVN revision: r7730
> >                 Open RTE: 1.0rc3r7730
> >    Open RTE SVN revision: r7730
> >                     OPAL: 1.0rc3r7730
> >        OPAL SVN revision: r7730
> >                   Prefix: /opt/openmpi-1.0rc3-pgi-6.0  Configured 
> > architecture: x86_64-unknown-linux-gnu
> >            Configured by: root
> >            Configured on: Mon Oct 17 10:10:28 PDT 2005
> >           Configure host: castor00
> >                 Built by: root
> >                 Built on: Mon Oct 17 10:29:20 PDT 2005
> >               Built host: castor00
> >               C bindings: yes
> >             C++ bindings: yes
> >       Fortran77 bindings: yes (all)
> >       Fortran90 bindings: yes
> >               C compiler: pgcc
> >      C compiler absolute: 
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc
> >             C++ compiler: pgCC
> >    C++ compiler absolute: 
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC
> >       Fortran77 compiler: pgf77
> >   Fortran77 compiler abs: 
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77
> >       Fortran90 compiler: pgf90
> >   Fortran90 compiler abs: 
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90
> >              C profiling: yes
> >            C++ profiling: yes
> >      Fortran77 profiling: yes
> >      Fortran90 profiling: yes
> >           C++ exceptions: no
> >           Thread support: posix (mpi: no, progress: no)
> >   Internal debug support: no
> >      MPI parameter check: runtime
> > Memory profiling support: no
> > Memory debugging support: no
> >          libltdl support: 1
> >               MCA memory: malloc_hooks (MCA v1.0, API v1.0, 
> Component
> > v1.0)
> >            MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)
> >            MCA maffinity: first_use (MCA v1.0, API v1.0, 
> Component v1.0)
> >            MCA maffinity: libnuma (MCA v1.0, API v1.0, 
> Component v1.0)
> >                MCA timer: linux (MCA v1.0, API v1.0, Component v1.0)
> >            MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> >            MCA allocator: bucket (MCA v1.0, API v1.0, 
> Component v1.0)
> >                 MCA coll: basic (MCA v1.0, API v1.0, Component v1.0)
> >                 MCA coll: self (MCA v1.0, API v1.0, Component v1.0)
> >                 MCA coll: sm (MCA v1.0, API v1.0, Component v1.0)
> >                   MCA io: romio (MCA v1.0, API v1.0, Component v1.0)
> >                MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0)
> >                MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA pml: teg (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA ptl: self (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA btl: gm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA btl: self (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA btl: sm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
> >                 MCA topo: unity (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA gpr: null (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA gpr: replica (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA iof: svc (MCA v1.0, API v1.0, Component v1.0)
> >                   MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0)
> >                   MCA ns: replica (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA ras: dash_host (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA ras: hostfile (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA ras: localhost (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA rds: hostfile (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA rds: resfile (MCA v1.0, API v1.0, 
> Component v1.0)
> >                MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
> > v1.0)
> >                 MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0)
> >                 MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA rml: oob (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA pls: daemon (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA pls: fork (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA sds: env (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA sds: seed (MCA v1.0, API v1.0, Component v1.0)
> >                  MCA sds: singleton (MCA v1.0, API v1.0, 
> Component v1.0)
> >                  MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0)
> > 
> > 
> > Here is the command-line I am using to invoke OpenMPI for 
> my build of
> > Tachyon:
> > 
> > /opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix 
> > /opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile 
> > hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat
> > 
> > Attaching gdb to one of the hung processes, I get the 
> following stack
> > trace:
> > 
> > (gdb) bt
> > #0  0x0000002a95d6b87d in opal_sys_timer_get_cycles ()
> >    from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> > #1  0x0000002a95d83509 in opal_timer_base_get_cycles ()
> >    from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> > #2  0x0000002a95d8370c in opal_progress ()
> >    from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> > #3  0x0000002a95a6d8a5 in opal_condition_wait ()
> >    from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> > #4  0x0000002a95a6de49 in ompi_request_wait_all ()
> >    from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> > #5  0x0000002a95937602 in PMPI_Waitall ()
> >    from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> > #6  0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60)
> >     at parallel.c:229
> > #7  0x000000000040b515 in renderscene (scene=0x6394d0) at 
> render.c:285 
> > #8  0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at 
> > api.c:95 #9  0x0000000000418ac7 in main (argc=6, 
> argv=0x7fbfffec38) at 
> > main.c:431
> > (gdb)
> > 
> > So based on this stack trace, it appears that the application is 
> > hanging on an MPI_Waitall call for some reason.
> > 
> > Does anyone have any ideas as to why this might be 
> happening?  If this 
> > is covered in the FAQ somewhere, then please accept my apologies in 
> > advance.
> > 
> > Many thanks,
> > 
> > +chris
> > 
> > --
> > Chris Parrott                    5204 E. Ben White Blvd., M/S 628
> > Product Development Engineer     Austin, TX 78741
> > Computational Products Group     (512) 602-8710 / (512) 
> 602-7745 (fax)
> > Advanced Micro Devices           chris.parrott_at_[hidden]
> > 
> > 
> > 
> ----------------------------------------------------------------------
> > --
> > 
> > _______________________________________________
> > users mailing list
> > users_at_[hidden] 
> http://www.open-> mpi.org/mailman/listinfo.cgi/users
> 
>