Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] program stalls in __write_nocancel()
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-11-06 08:53:50


Hi Peter

Given how long it takes to hit the problem, have you checked your file
and disk quotas? Could be that the file is simply getting too big.

I'm also a tad curious how you got valgrind to work on OSX - I was
unaware it supported that environment?

If all that looks okay, then the next thing would be to put some kind
of check in handle_message to see what message you are actually
attempting to output when it hangs. See if there is something that
would cause fputs to have a heart attack - perhaps you have a message
counter that rolls over (e.g., a 16-bit counter that rolls after you
get too many messages).

Ralph

On Nov 5, 2008, at 8:12 PM, Peter Beerli wrote:

> On some of my larger problems ,
> my program stalls and does not continue
>
> (50 or more nodes, 'long' runs >5 hours). My program is set up as a
> master-worker
> and it seems that the master gets stuck in a write to stdout see gdb
> backtrace below (It took all day
> to get there on 50 nodes). the function handle_message is simply
> printing to the stdout in this case.
> Of course the workers keep sending stuff to the master, but the
> master is stuck
> writing that does not finish. Any idea where to look next?
> [smaller runs look fine, valgrind did not find problems in my code
> (complaining a lot about openmpi so)
> I attach also the ompi_info to show versions (OS is macos 10.5.5)
> any idea what is going on? [any hint is welcome!]
>
> thanks
> Peter
>
> (gdb) bt
> #0 0x00000037528c0e50 in __write_nocancel () from /lib64/libc.so.6
> #1 0x00000037528694b3 in _IO_new_file_write () from /lib64/libc.so.6
> #2 0x00000037528693c6 in _IO_new_do_write () from /lib64/libc.so.6
> #3 0x000000375286a822 in _IO_new_file_xsputn () from /lib64/libc.so.6
> #4 0x000000375285f4f8 in fputs () from /lib64/libc.so.6
> #5 0x000000000045e9de in handle_message (
> rawmessage=0x4bb8830 "M0:[ 12] Swapping between 4 temperatures.
> \n", ' ' <repeats 11 times>, "Temperature | Accepted | Swaps
> between temperatures\n", ' ' <repeats 16 times>, "1e+06 | 0.00
> | |\n", ' ' <repeats 15 times>, "3.0000 | 0.08
> | 1 ||"..., sender=12, world=0x448d8b0)
> at migrate_mpi.c:3663
> #6 0x000000000045362a in mpi_runloci_master (loci=1, who=0x4541fc0,
> world=0x448d8b0, options_readsum=0, menu=0) at migrate_mpi.c:228
> #7 0x000000000044ed86 in run_sampler (options=0x448dc20,
> data=0x4465a10,
> universe=0x42b90c0, usize=4, outfilepos=0x7fff0ff98ee0,
> Gmax=0x7fff0ff98ee8) at main.c:885
> #8 0x000000000044dff2 in main (argc=3, argv=0x7fff0ff99008) at
> main.c:422
>
>
> petal:~>ompi_info
> Open MPI: 1.2.8
> Open MPI SVN revision: r19718
> Open RTE: 1.2.8
> Open RTE SVN revision: r19718
> OPAL: 1.2.8
> OPAL SVN revision: r19718
> Prefix: /home/beerli/openmpi
> Configured architecture: x86_64-unknown-linux-gnu
> Configured by: beerli
> Configured on: Mon Nov 3 15:00:02 EST 2008
> Configure host: petal
> Built by: beerli
> Built on: Mon Nov 3 15:08:02 EST 2008
> Built host: petal
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: gcc
> C compiler absolute: /usr/bin/gcc
> C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
> Fortran77 compiler: gfortran
> Fortran77 compiler abs: /usr/bin/gfortran
> Fortran90 compiler: gfortran
> Fortran90 compiler abs: /usr/bin/gfortran
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: no, progress: no)
> Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> Heterogeneous support: yes
> mpirun default --prefix: no
> MCA backtrace: execinfo (MCA v1.0, API v1.0, Component
> v1.2.8)
> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
> v1.2.8)
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.8)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component
> v1.2.8)
> MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.8)
> MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.8)
> MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.8)
> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.8)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.2.8)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.8)
> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.8)
> MCA io: romio (MCA v1.0, API v1.0, Component v1.2.8)
> MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.8)
> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.8)
> MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)
> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.8)
> MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.8)
> MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.8)
> MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.8)
> MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
> MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.8)
> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.8)
> MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.8)
> MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.8)
> MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.8)
> MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.8)
> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.8)
> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.8)
> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.8)
> MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.8)
> MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.8)
> MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.8)
> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: dash_host (MCA v1.0, API v1.3, Component
> v1.2.8)
> MCA ras: gridengine (MCA v1.0, API v1.3, Component
> v1.2.8)
> MCA ras: localhost (MCA v1.0, API v1.3, Component
> v1.2.8)
> MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.8)
> MCA rds: hostfile (MCA v1.0, API v1.3, Component
> v1.2.8)
> MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.8)
> MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.8)
> MCA rmaps: round_robin (MCA v1.0, API v1.3, Component
> v1.2.8)
> MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.8)
> MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.8)
> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.8)
> MCA pls: gridengine (MCA v1.0, API v1.3, Component
> v1.2.8)
> MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.8)
> MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.8)
> MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.8)
> MCA sds: env (MCA v1.0, API v1.0, Component v1.2.8)
> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.8)
> MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.8)
> MCA sds: singleton (MCA v1.0, API v1.0, Component
> v1.2.8)
> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.8)
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users