Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Victor Prosolin (victor.prosolin_at_[hidden])
Date: 2006-11-16 16:51:23


Hi all.
I have been fighting with this problem for weeks now, and I am getting
quite desperate about it. Hope I can get help here, because local folks
couldn't help me.

There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4
(Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca)
They have some mpi libraries (LAM I beleive) installed, but since they
don't support
Fortran90, I compile my own library. I install it in my home directory
/home/victor/programs. I configure with the following options

F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90
--prefix=/home/victor/programs --enable-pretty-print-stacktrace
--config-cache --disable-shared --enable-static

It compiles and installs with no errors. But when I run my code by using
mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable
(mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to
avoid conflict with system-wide mpiexec)

it dies silently with no errors shown - just stops and says
2 additional processes aborted (not shown)

It depends on the number of grid points, because for some
small grid sizes (40x10x10) it runs fine. But the number at which I
start getting problems is stupidly small (like 40x20x10) so it can't be
an insufficient memory issue - the cluster server has 2Gb of memory and
I can run my code in serial mode with at least 200x100x100.

Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to
compile the library, but I've tried different compilers (g95-gcc,
ifort-gcc4.1) - same result all the time. As far as I can say, it's not
an error in my code either, because I've done numerous checks and also
it runs fine on my pc, though on my pc I compiled the library with ifort
and icc.
And here comes the weirdest part - if I run my code through valgrind in
mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it
runs fine with grid sizes it fails on without valgrind!!! It doesn't
exit mpiexec, but does get to the last statement of my code.

I am attaching config.log and ompi_info.log
The following is the output of mpiexex -d -np 4 ./model-0.0.9:

[obelix:08876] procdir: (null)
[obelix:08876] jobdir: (null)
[obelix:08876] unidir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe
[obelix:08876] top: openmpi-sessions-victor_at_obelix_0
[obelix:08876] tmp: /tmp
[obelix:08876] connect_uni: contact info read
[obelix:08876] connect_uni: connection not allowed
[obelix:08876] [0,0,0] setting up session dir with
[obelix:08876] tmpdir /tmp
[obelix:08876] universe default-universe-8876
[obelix:08876] user victor
[obelix:08876] host obelix
[obelix:08876] jobid 0
[obelix:08876] procid 0
[obelix:08876] procdir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0/0
[obelix:08876] jobdir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0
[obelix:08876] unidir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876
[obelix:08876] top: openmpi-sessions-victor_at_obelix_0
[obelix:08876] tmp: /tmp
[obelix:08876] [0,0,0] contact_file
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/universe-setup.txt
[obelix:08876] [0,0,0] wrote setup file
[obelix:08876] pls:rsh: local csh: 0, local bash: 1
[obelix:08876] pls:rsh: assuming same remote shell as local shell
[obelix:08876] pls:rsh: remote csh: 0, remote bash: 1
[obelix:08876] pls:rsh: final template argv:
[obelix:08876] pls:rsh: /usr/bin/ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
<template> --universe victor_at_obelix:default-universe-8876 --nsreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
--mpi-call-yield 0
[obelix:08876] pls:rsh: launching on node localhost
[obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to
1 (1 4)
[obelix:08876] pls:rsh: localhost is a LOCAL node
[obelix:08876] pls:rsh: changing to directory /home/victor
[obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name
0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe
victor_at_obelix:default-universe-8876 --nsreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
--mpi-call-yield 1
[obelix:08877] [0,0,1] setting up session dir with
[obelix:08877] universe default-universe-8876
[obelix:08877] user victor
[obelix:08877] host localhost
[obelix:08877] jobid 0
[obelix:08877] procid 1
[obelix:08877] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0/1
[obelix:08877] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0
[obelix:08877] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08877] top: openmpi-sessions-victor_at_localhost_0
[obelix:08877] tmp: /tmp
[obelix:08878] [0,1,0] setting up session dir with
[obelix:08878] universe default-universe-8876
[obelix:08878] user victor
[obelix:08878] host localhost
[obelix:08878] jobid 1
[obelix:08878] procid 0
[obelix:08878] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/0
[obelix:08878] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08878] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08878] top: openmpi-sessions-victor_at_localhost_0
[obelix:08878] tmp: /tmp
[obelix:08879] [0,1,1] setting up session dir with
[obelix:08879] universe default-universe-8876
[obelix:08879] user victor
[obelix:08879] host localhost
[obelix:08879] jobid 1
[obelix:08879] procid 1
[obelix:08879] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/1
[obelix:08879] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08879] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08879] top: openmpi-sessions-victor_at_localhost_0
[obelix:08879] tmp: /tmp
[obelix:08880] [0,1,2] setting up session dir with
[obelix:08880] universe default-universe-8876
[obelix:08880] user victor
[obelix:08880] host localhost
[obelix:08880] jobid 1
[obelix:08880] procid 2
[obelix:08880] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/2
[obelix:08880] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08880] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08880] top: openmpi-sessions-victor_at_localhost_0
[obelix:08880] tmp: /tmp
[obelix:08881] [0,1,3] setting up session dir with
[obelix:08881] universe default-universe-8876
[obelix:08881] user victor
[obelix:08881] host localhost
[obelix:08881] jobid 1
[obelix:08881] procid 3
[obelix:08881] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/3
[obelix:08881] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08881] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08881] top: openmpi-sessions-victor_at_localhost_0
[obelix:08881] tmp: /tmp
[obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4)
[obelix:08876] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 4
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878)
    (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879)
    (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880)
    (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881)
[obelix:08878] [0,1,0] ompi_mpi_init completed
[obelix:08879] [0,1,1] ompi_mpi_init completed
[obelix:08880] [0,1,2] ompi_mpi_init completed
[obelix:08881] [0,1,3] ompi_mpi_init completed
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_TERMINATED)
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: found job session dir empty - deleting
[obelix:08877] sess_dir_finalize: univ session dir not empty - leaving

Thank you,
Victor Prosolin.