Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-11-16 22:10:28


>From what you sent, it appears that Open MPI thinks your processes called
MPI_Abort (as opposed to segfaulting or some other failure mode). The system
appears to be operating exactly as it should - it just thinks your program
aborted the job - i.e., that one or more processes actually called MPI_Abort
for some reason.

Have you tried running your code without valgrind? I'm wondering if the
valgrind interaction may be part of the problem.

Do you have a code path in your program that would lead to MPI_Abort? I'm
wondering if you have some logic that might abort if it encounters what it
believes is a problem. If so, you might put some output in that path to see
if you are traversing it. Then we would have some idea as to why the code
thinks it *should* abort.

Others may also have suggestions. Most of the team is at the Supercomputing
show this week and won't really be available until next week or after
Thanksgiving.

Ralph

On 11/16/06 2:51 PM, "Victor Prosolin" <victor.prosolin_at_[hidden]> wrote:

> Hi all.
> I have been fighting with this problem for weeks now, and I am getting
> quite desperate about it. Hope I can get help here, because local folks
> couldn't help me.
>
> There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4
> (Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca)
> They have some mpi libraries (LAM I beleive) installed, but since they
> don't support
> Fortran90, I compile my own library. I install it in my home directory
> /home/victor/programs. I configure with the following options
>
> F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90
> --prefix=/home/victor/programs --enable-pretty-print-stacktrace
> --config-cache --disable-shared --enable-static
>
> It compiles and installs with no errors. But when I run my code by using
> mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable
> (mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to
> avoid conflict with system-wide mpiexec)
>
> it dies silently with no errors shown - just stops and says
> 2 additional processes aborted (not shown)
>
> It depends on the number of grid points, because for some
> small grid sizes (40x10x10) it runs fine. But the number at which I
> start getting problems is stupidly small (like 40x20x10) so it can't be
> an insufficient memory issue - the cluster server has 2Gb of memory and
> I can run my code in serial mode with at least 200x100x100.
>
> Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to
> compile the library, but I've tried different compilers (g95-gcc,
> ifort-gcc4.1) - same result all the time. As far as I can say, it's not
> an error in my code either, because I've done numerous checks and also
> it runs fine on my pc, though on my pc I compiled the library with ifort
> and icc.
> And here comes the weirdest part - if I run my code through valgrind in
> mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it
> runs fine with grid sizes it fails on without valgrind!!! It doesn't
> exit mpiexec, but does get to the last statement of my code.
>
> I am attaching config.log and ompi_info.log
> The following is the output of mpiexex -d -np 4 ./model-0.0.9:
>
> [obelix:08876] procdir: (null)
> [obelix:08876] jobdir: (null)
> [obelix:08876] unidir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe
> [obelix:08876] top: openmpi-sessions-victor_at_obelix_0
> [obelix:08876] tmp: /tmp
> [obelix:08876] connect_uni: contact info read
> [obelix:08876] connect_uni: connection not allowed
> [obelix:08876] [0,0,0] setting up session dir with
> [obelix:08876] tmpdir /tmp
> [obelix:08876] universe default-universe-8876
> [obelix:08876] user victor
> [obelix:08876] host obelix
> [obelix:08876] jobid 0
> [obelix:08876] procid 0
> [obelix:08876] procdir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0/0
> [obelix:08876] jobdir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0
> [obelix:08876] unidir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876
> [obelix:08876] top: openmpi-sessions-victor_at_obelix_0
> [obelix:08876] tmp: /tmp
> [obelix:08876] [0,0,0] contact_file
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/universe-setup.txt
> [obelix:08876] [0,0,0] wrote setup file
> [obelix:08876] pls:rsh: local csh: 0, local bash: 1
> [obelix:08876] pls:rsh: assuming same remote shell as local shell
> [obelix:08876] pls:rsh: remote csh: 0, remote bash: 1
> [obelix:08876] pls:rsh: final template argv:
> [obelix:08876] pls:rsh: /usr/bin/ssh <template> orted --debug
> --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> <template> --universe victor_at_obelix:default-universe-8876 --nsreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
> --mpi-call-yield 0
> [obelix:08876] pls:rsh: launching on node localhost
> [obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to
> 1 (1 4)
> [obelix:08876] pls:rsh: localhost is a LOCAL node
> [obelix:08876] pls:rsh: changing to directory /home/victor
> [obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name
> 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe
> victor_at_obelix:default-universe-8876 --nsreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
> --mpi-call-yield 1
> [obelix:08877] [0,0,1] setting up session dir with
> [obelix:08877] universe default-universe-8876
> [obelix:08877] user victor
> [obelix:08877] host localhost
> [obelix:08877] jobid 0
> [obelix:08877] procid 1
> [obelix:08877] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0/1
> [obelix:08877] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0
> [obelix:08877] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08877] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08877] tmp: /tmp
> [obelix:08878] [0,1,0] setting up session dir with
> [obelix:08878] universe default-universe-8876
> [obelix:08878] user victor
> [obelix:08878] host localhost
> [obelix:08878] jobid 1
> [obelix:08878] procid 0
> [obelix:08878] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/0
> [obelix:08878] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08878] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08878] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08878] tmp: /tmp
> [obelix:08879] [0,1,1] setting up session dir with
> [obelix:08879] universe default-universe-8876
> [obelix:08879] user victor
> [obelix:08879] host localhost
> [obelix:08879] jobid 1
> [obelix:08879] procid 1
> [obelix:08879] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/1
> [obelix:08879] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08879] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08879] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08879] tmp: /tmp
> [obelix:08880] [0,1,2] setting up session dir with
> [obelix:08880] universe default-universe-8876
> [obelix:08880] user victor
> [obelix:08880] host localhost
> [obelix:08880] jobid 1
> [obelix:08880] procid 2
> [obelix:08880] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/2
> [obelix:08880] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08880] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08880] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08880] tmp: /tmp
> [obelix:08881] [0,1,3] setting up session dir with
> [obelix:08881] universe default-universe-8876
> [obelix:08881] user victor
> [obelix:08881] host localhost
> [obelix:08881] jobid 1
> [obelix:08881] procid 3
> [obelix:08881] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/3
> [obelix:08881] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08881] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08881] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08881] tmp: /tmp
> [obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4)
> [obelix:08876] Info: Setting up debugger process table for applications
> MPIR_being_debugged = 0
> MPIR_debug_gate = 0
> MPIR_debug_state = 1
> MPIR_acquired_pre_main = 0
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 4
> MPIR_proctable:
> (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878)
> (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879)
> (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880)
> (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881)
> [obelix:08878] [0,1,0] ompi_mpi_init completed
> [obelix:08879] [0,1,1] ompi_mpi_init completed
> [obelix:08880] [0,1,2] ompi_mpi_init completed
> [obelix:08881] [0,1,3] ompi_mpi_init completed
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] orted: job_state_callback(jobid = 1, state =
> ORTE_PROC_STATE_ABORTED)
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] orted: job_state_callback(jobid = 1, state =
> ORTE_PROC_STATE_TERMINATED)
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: found job session dir empty - deleting
> [obelix:08877] sess_dir_finalize: univ session dir not empty - leaving
>
> Thank you,
> Victor Prosolin.
> Open MPI: 1.1.2
> Open MPI SVN revision: r12073
> Open RTE: 1.1.2
> Open RTE SVN revision: r12073
> OPAL: 1.1.2
> OPAL SVN revision: r12073
> Prefix: /home/victor/programs
> Configured architecture: i686-pc-linux-gnu
> Configured by: victor
> Configured on: Thu Nov 16 13:06:12 MST 2006
> Configure host: obelix
> Built by: victor
> Built on: Thu Nov 16 13:42:40 MST 2006
> Built host: obelix
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: distcc
> C compiler absolute: /home/victor/programs/bin/distcc
> C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
> Fortran77 compiler: ifort
> Fortran77 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
> Fortran90 compiler: ifort
> Fortran90 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: no, progress: no)
> Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.2)
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.2)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2)
> MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.2)
> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
> MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
> MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
> MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
> MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: poe (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: env (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: seed (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1.2)
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users