Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Ralph Castain (rhc_at_[hidden])
Date: 2006-11-16 22:10:28


>From what you sent, it appears that Open MPI thinks your processes called
MPI_Abort (as opposed to segfaulting or some other failure mode). The system
appears to be operating exactly as it should - it just thinks your program
aborted the job - i.e., that one or more processes actually called MPI_Abort
for some reason.

Have you tried running your code without valgrind? I'm wondering if the
valgrind interaction may be part of the problem.

Do you have a code path in your program that would lead to MPI_Abort? I'm
wondering if you have some logic that might abort if it encounters what it
believes is a problem. If so, you might put some output in that path to see
if you are traversing it. Then we would have some idea as to why the code
thinks it *should* abort.

Others may also have suggestions. Most of the team is at the Supercomputing
show this week and won't really be available until next week or after
Thanksgiving.

Ralph

On 11/16/06 2:51 PM, "Victor Prosolin" <victor.prosolin_at_[hidden]> wrote:

> Hi all.
> I have been fighting with this problem for weeks now, and I am getting
> quite desperate about it. Hope I can get help here, because local folks
> couldn't help me.
>
> There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4
> (Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca)
> They have some mpi libraries (LAM I beleive) installed, but since they
> don't support
> Fortran90, I compile my own library. I install it in my home directory
> /home/victor/programs. I configure with the following options
>
> F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90
> --prefix=/home/victor/programs --enable-pretty-print-stacktrace
> --config-cache --disable-shared --enable-static
>
> It compiles and installs with no errors. But when I run my code by using
> mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable
> (mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to
> avoid conflict with system-wide mpiexec)
>
> it dies silently with no errors shown - just stops and says
> 2 additional processes aborted (not shown)
>
> It depends on the number of grid points, because for some
> small grid sizes (40x10x10) it runs fine. But the number at which I
> start getting problems is stupidly small (like 40x20x10) so it can't be
> an insufficient memory issue - the cluster server has 2Gb of memory and
> I can run my code in serial mode with at least 200x100x100.
>
> Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to
> compile the library, but I've tried different compilers (g95-gcc,
> ifort-gcc4.1) - same result all the time. As far as I can say, it's not
> an error in my code either, because I've done numerous checks and also
> it runs fine on my pc, though on my pc I compiled the library with ifort
> and icc.
> And here comes the weirdest part - if I run my code through valgrind in
> mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it
> runs fine with grid sizes it fails on without valgrind!!! It doesn't
> exit mpiexec, but does get to the last statement of my code.
>
> I am attaching config.log and ompi_info.log
> The following is the output of mpiexex -d -np 4 ./model-0.0.9:
>
> [obelix:08876] procdir: (null)
> [obelix:08876] jobdir: (null)
> [obelix:08876] unidir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe
> [obelix:08876] top: openmpi-sessions-victor_at_obelix_0
> [obelix:08876] tmp: /tmp
> [obelix:08876] connect_uni: contact info read
> [obelix:08876] connect_uni: connection not allowed
> [obelix:08876] [0,0,0] setting up session dir with
> [obelix:08876] tmpdir /tmp
> [obelix:08876] universe default-universe-8876
> [obelix:08876] user victor
> [obelix:08876] host obelix
> [obelix:08876] jobid 0
> [obelix:08876] procid 0
> [obelix:08876] procdir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0/0
> [obelix:08876] jobdir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0
> [obelix:08876] unidir:
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876
> [obelix:08876] top: openmpi-sessions-victor_at_obelix_0
> [obelix:08876] tmp: /tmp
> [obelix:08876] [0,0,0] contact_file
> /tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/universe-setup.txt
> [obelix:08876] [0,0,0] wrote setup file
> [obelix:08876] pls:rsh: local csh: 0, local bash: 1
> [obelix:08876] pls:rsh: assuming same remote shell as local shell
> [obelix:08876] pls:rsh: remote csh: 0, remote bash: 1
> [obelix:08876] pls:rsh: final template argv:
> [obelix:08876] pls:rsh: /usr/bin/ssh <template> orted --debug
> --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> <template> --universe victor_at_obelix:default-universe-8876 --nsreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
> --mpi-call-yield 0
> [obelix:08876] pls:rsh: launching on node localhost
> [obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to
> 1 (1 4)
> [obelix:08876] pls:rsh: localhost is a LOCAL node
> [obelix:08876] pls:rsh: changing to directory /home/victor
> [obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name
> 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe
> victor_at_obelix:default-universe-8876 --nsreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
> "0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
> --mpi-call-yield 1
> [obelix:08877] [0,0,1] setting up session dir with
> [obelix:08877] universe default-universe-8876
> [obelix:08877] user victor
> [obelix:08877] host localhost
> [obelix:08877] jobid 0
> [obelix:08877] procid 1
> [obelix:08877] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0/1
> [obelix:08877] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0
> [obelix:08877] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08877] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08877] tmp: /tmp
> [obelix:08878] [0,1,0] setting up session dir with
> [obelix:08878] universe default-universe-8876
> [obelix:08878] user victor
> [obelix:08878] host localhost
> [obelix:08878] jobid 1
> [obelix:08878] procid 0
> [obelix:08878] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/0
> [obelix:08878] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08878] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08878] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08878] tmp: /tmp
> [obelix:08879] [0,1,1] setting up session dir with
> [obelix:08879] universe default-universe-8876
> [obelix:08879] user victor
> [obelix:08879] host localhost
> [obelix:08879] jobid 1
> [obelix:08879] procid 1
> [obelix:08879] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/1
> [obelix:08879] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08879] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08879] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08879] tmp: /tmp
> [obelix:08880] [0,1,2] setting up session dir with
> [obelix:08880] universe default-universe-8876
> [obelix:08880] user victor
> [obelix:08880] host localhost
> [obelix:08880] jobid 1
> [obelix:08880] procid 2
> [obelix:08880] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/2
> [obelix:08880] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08880] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08880] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08880] tmp: /tmp
> [obelix:08881] [0,1,3] setting up session dir with
> [obelix:08881] universe default-universe-8876
> [obelix:08881] user victor
> [obelix:08881] host localhost
> [obelix:08881] jobid 1
> [obelix:08881] procid 3
> [obelix:08881] procdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/3
> [obelix:08881] jobdir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
> [obelix:08881] unidir:
> /tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
> [obelix:08881] top: openmpi-sessions-victor_at_localhost_0
> [obelix:08881] tmp: /tmp
> [obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4)
> [obelix:08876] Info: Setting up debugger process table for applications
> MPIR_being_debugged = 0
> MPIR_debug_gate = 0
> MPIR_debug_state = 1
> MPIR_acquired_pre_main = 0
> MPIR_i_am_starter = 0
> MPIR_proctable_size = 4
> MPIR_proctable:
> (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878)
> (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879)
> (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880)
> (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881)
> [obelix:08878] [0,1,0] ompi_mpi_init completed
> [obelix:08879] [0,1,1] ompi_mpi_init completed
> [obelix:08880] [0,1,2] ompi_mpi_init completed
> [obelix:08881] [0,1,3] ompi_mpi_init completed
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] orted: job_state_callback(jobid = 1, state =
> ORTE_PROC_STATE_ABORTED)
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] orted: job_state_callback(jobid = 1, state =
> ORTE_PROC_STATE_TERMINATED)
> [obelix:08877] sess_dir_finalize: job session dir not empty - leaving
> [obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
> [obelix:08877] sess_dir_finalize: found job session dir empty - deleting
> [obelix:08877] sess_dir_finalize: univ session dir not empty - leaving
>
> Thank you,
> Victor Prosolin.
> Open MPI: 1.1.2
> Open MPI SVN revision: r12073
> Open RTE: 1.1.2
> Open RTE SVN revision: r12073
> OPAL: 1.1.2
> OPAL SVN revision: r12073
> Prefix: /home/victor/programs
> Configured architecture: i686-pc-linux-gnu
> Configured by: victor
> Configured on: Thu Nov 16 13:06:12 MST 2006
> Configure host: obelix
> Built by: victor
> Built on: Thu Nov 16 13:42:40 MST 2006
> Built host: obelix
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: distcc
> C compiler absolute: /home/victor/programs/bin/distcc
> C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
> Fortran77 compiler: ifort
> Fortran77 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
> Fortran90 compiler: ifort
> Fortran90 compiler abs: /opt/intel/fc/9.1.037/bin/ifort
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: no, progress: no)
> Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.2)
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.2)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2)
> MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.2)
> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
> MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
> MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
> MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
> MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: poe (MCA v1.0, API v1.0, Component v1.1.2)
> MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pls: fork (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1.2)
> MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: env (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: seed (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1.2)
> MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1.2)
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users