Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-10-28 14:26:05


Sorry to take so long to reply -- as a token of my apology, accept this
patch to your Make-arch to fix up a few entries with LAM/MPI and add
entries for Open MPI (yay open source!). :-)

(note that the LAM/MPI and Open MPI entries are identical except for
the ARCH strings)

We have committed a bunch of fixes post-rc4 that seem to have fixed the
problems in your raytracer app -- I know that we still have some bugs
left, but I am able to run tachyon with the 2balls.dat sample file over
Myrinet with 16 processes.

I just initiated a snapshot tarball creation; should be up on the web
site under the "nightly snapshots" downloads section in ~30 minutes:
http://www.open-mpi.org/nightly/v1.0/. Look for r7924.

Can you give it a whirl again with this tarball (or svn checkout)?

Thanks!


On Oct 18, 2005, at 2:24 PM, Parrott, Chris wrote:

>
> Tim,
>
> I just tried this same code again with 1.0rc4, and I still see the same
> symptom. The gdb stack trace for a hung process looks a bit different
> this time, however:
>
> (gdb) bt
> #0 0x0000002a98a085e1 in mca_bml_r2_progress ()
> from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_bml_r2.so
> #1 0x0000002a986c3080 in mca_pml_ob1_progress ()
> from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_pml_ob1.so
> #2 0x0000002a95d8378c in opal_progress ()
> from /opt/openmpi-1.0rc4-pgi-6.0/lib/libopal.so.0
> #3 0x0000002a95a6d8a5 in opal_condition_wait ()
> from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
> #4 0x0000002a95a6de49 in ompi_request_wait_all ()
> from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
> #5 0x0000002a95937602 in PMPI_Waitall ()
> from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
> #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635e10)
> at parallel.c:229
> #7 0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285
> #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
> api.c:95
> #9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at
> main.c:431
> (gdb)
>
>
> It still seems to be stuck in the MPI_Waitall call, for some reason.
>
> Any ideas? If you need any additional information from me, please let
> me know.
>
> Thanks in advance,
>
> +chris
>
> --
> Chris Parrott 5204 E. Ben White Blvd., M/S 628
> Product Development Engineer Austin, TX 78741
> Computational Products Group (512) 602-8710 / (512) 602-7745 (fax)
> Advanced Micro Devices chris.parrott_at_[hidden]
>
>> -----Original Message-----
>> From: Tim S. Woodall [mailto:twoodall_at_[hidden]]
>> Sent: Monday, October 17, 2005 2:01 PM
>> To: Open MPI Users
>> Cc: Parrott, Chris
>> Subject: Re: [O-MPI users] OpenMPI hang issue
>>
>>
>> Hello Chris,
>>
>> Please give the next release candidate a try. There was an
>> issue w/ the GM port that was likely causing this.
>>
>> Thanks,
>> Tim
>>
>>
>> Parrott, Chris wrote:
>>> Greetings,
>>>
>>> I have been testing OpenMPI 1.0rc3 on a rack of 8
>> 2-processor (single
>>> core) Opteron systems connected via both Gigabit Ethernet
>> and Myrinet.
>>> My testing has been mostly successful, although I have run into a
>>> recurring issue on a few MPI applications. The symptom is that the
>>> computation seems to progress nearly to completion, and
>> then suddenly
>>> just hangs without terminating. One code that demonstrates this is
>>> the Tachyon parallel raytracer, available at:
>>>
>>> http://jedi.ks.uiuc.edu/~johns/raytracer/
>>>
>>> I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the
>>> root cause of this particular problem.
>>>
>>> I have attached the output of config.log to this message.
>> Here is the
>>> output from ompi_info:
>>>
>>> Open MPI: 1.0rc3r7730
>>> Open MPI SVN revision: r7730
>>> Open RTE: 1.0rc3r7730
>>> Open RTE SVN revision: r7730
>>> OPAL: 1.0rc3r7730
>>> OPAL SVN revision: r7730
>>> Prefix: /opt/openmpi-1.0rc3-pgi-6.0 Configured
>>> architecture: x86_64-unknown-linux-gnu
>>> Configured by: root
>>> Configured on: Mon Oct 17 10:10:28 PDT 2005
>>> Configure host: castor00
>>> Built by: root
>>> Built on: Mon Oct 17 10:29:20 PDT 2005
>>> Built host: castor00
>>> C bindings: yes
>>> C++ bindings: yes
>>> Fortran77 bindings: yes (all)
>>> Fortran90 bindings: yes
>>> C compiler: pgcc
>>> C compiler absolute:
>>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc
>>> C++ compiler: pgCC
>>> C++ compiler absolute:
>>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC
>>> Fortran77 compiler: pgf77
>>> Fortran77 compiler abs:
>>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77
>>> Fortran90 compiler: pgf90
>>> Fortran90 compiler abs:
>>> /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90
>>> C profiling: yes
>>> C++ profiling: yes
>>> Fortran77 profiling: yes
>>> Fortran90 profiling: yes
>>> C++ exceptions: no
>>> Thread support: posix (mpi: no, progress: no)
>>> Internal debug support: no
>>> MPI parameter check: runtime
>>> Memory profiling support: no
>>> Memory debugging support: no
>>> libltdl support: 1
>>> MCA memory: malloc_hooks (MCA v1.0, API v1.0,
>> Component
>>> v1.0)
>>> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)
>>> MCA maffinity: first_use (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA maffinity: libnuma (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA timer: linux (MCA v1.0, API v1.0, Component v1.0)
>>> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>>> MCA allocator: bucket (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA coll: basic (MCA v1.0, API v1.0, Component v1.0)
>>> MCA coll: self (MCA v1.0, API v1.0, Component v1.0)
>>> MCA coll: sm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA io: romio (MCA v1.0, API v1.0, Component v1.0)
>>> MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0)
>>> MCA pml: teg (MCA v1.0, API v1.0, Component v1.0)
>>> MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ptl: self (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0)
>>> MCA btl: gm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA btl: self (MCA v1.0, API v1.0, Component v1.0)
>>> MCA btl: sm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
>>> MCA topo: unity (MCA v1.0, API v1.0, Component v1.0)
>>> MCA gpr: null (MCA v1.0, API v1.0, Component v1.0)
>>> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0)
>>> MCA gpr: replica (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0)
>>> MCA iof: svc (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ns: replica (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>>> MCA ras: dash_host (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA ras: hostfile (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA ras: localhost (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA rds: hostfile (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA rds: resfile (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
>>> v1.0)
>>> MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0)
>>> MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.0)
>>> MCA pls: daemon (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA pls: fork (MCA v1.0, API v1.0, Component v1.0)
>>> MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0)
>>> MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0)
>>> MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0)
>>> MCA sds: env (MCA v1.0, API v1.0, Component v1.0)
>>> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0)
>>> MCA sds: seed (MCA v1.0, API v1.0, Component v1.0)
>>> MCA sds: singleton (MCA v1.0, API v1.0,
>> Component v1.0)
>>> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0)
>>>
>>>
>>> Here is the command-line I am using to invoke OpenMPI for
>> my build of
>>> Tachyon:
>>>
>>> /opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix
>>> /opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile
>>> hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat
>>>
>>> Attaching gdb to one of the hung processes, I get the
>> following stack
>>> trace:
>>>
>>> (gdb) bt
>>> #0 0x0000002a95d6b87d in opal_sys_timer_get_cycles ()
>>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
>>> #1 0x0000002a95d83509 in opal_timer_base_get_cycles ()
>>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
>>> #2 0x0000002a95d8370c in opal_progress ()
>>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
>>> #3 0x0000002a95a6d8a5 in opal_condition_wait ()
>>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
>>> #4 0x0000002a95a6de49 in ompi_request_wait_all ()
>>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
>>> #5 0x0000002a95937602 in PMPI_Waitall ()
>>> from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
>>> #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60)
>>> at parallel.c:229
>>> #7 0x000000000040b515 in renderscene (scene=0x6394d0) at
>> render.c:285
>>> #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
>>> api.c:95 #9 0x0000000000418ac7 in main (argc=6,
>> argv=0x7fbfffec38) at
>>> main.c:431
>>> (gdb)
>>>
>>> So based on this stack trace, it appears that the application is
>>> hanging on an MPI_Waitall call for some reason.
>>>
>>> Does anyone have any ideas as to why this might be
>> happening? If this
>>> is covered in the FAQ somewhere, then please accept my apologies in
>>> advance.
>>>
>>> Many thanks,
>>>
>>> +chris
>>>
>>> --
>>> Chris Parrott 5204 E. Ben White Blvd., M/S 628
>>> Product Development Engineer Austin, TX 78741
>>> Computational Products Group (512) 602-8710 / (512)
>> 602-7745 (fax)
>>> Advanced Micro Devices chris.parrott_at_[hidden]
>>>
>>>
>>>
>> ----------------------------------------------------------------------
>>> --
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> http://www.open-> mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/