Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Victor Prosolin (victor.prosolin_at_[hidden])
Date: 2006-11-16 16:51:23


Hi all.
I have been fighting with this problem for weeks now, and I am getting
quite desperate about it. Hope I can get help here, because local folks
couldn't help me.

There is a cluster running Debian Linux - kernel 2.4, gcc version 3.3.4
(Debian 1:3.3.4-13), . (some more info at ttp://www.capca.ucalgary.ca)
They have some mpi libraries (LAM I beleive) installed, but since they
don't support
Fortran90, I compile my own library. I install it in my home directory
/home/victor/programs. I configure with the following options

F77=ifort FFLAGS='-O2' FC=ifort CC=distcc ./configure --enable-mpi-f90
--prefix=/home/victor/programs --enable-pretty-print-stacktrace
--config-cache --disable-shared --enable-static

It compiles and installs with no errors. But when I run my code by using
mpiexec1 -np 4 valgrind --tool=memcheck ./my-executable
(mpiexec1 is a link pointing to /home/victor/programs/bin/mpiexec to
avoid conflict with system-wide mpiexec)

it dies silently with no errors shown - just stops and says
2 additional processes aborted (not shown)

It depends on the number of grid points, because for some
small grid sizes (40x10x10) it runs fine. But the number at which I
start getting problems is stupidly small (like 40x20x10) so it can't be
an insufficient memory issue - the cluster server has 2Gb of memory and
I can run my code in serial mode with at least 200x100x100.

Mainly I use Intel Fortran and gcc (or distcc pointing to gcc) to
compile the library, but I've tried different compilers (g95-gcc,
ifort-gcc4.1) - same result all the time. As far as I can say, it's not
an error in my code either, because I've done numerous checks and also
it runs fine on my pc, though on my pc I compiled the library with ifort
and icc.
And here comes the weirdest part - if I run my code through valgrind in
mpi mode (mpiexec -np 4 valgrind --tool=memcheck ./my-executable) - it
runs fine with grid sizes it fails on without valgrind!!! It doesn't
exit mpiexec, but does get to the last statement of my code.

I am attaching config.log and ompi_info.log
The following is the output of mpiexex -d -np 4 ./model-0.0.9:

[obelix:08876] procdir: (null)
[obelix:08876] jobdir: (null)
[obelix:08876] unidir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe
[obelix:08876] top: openmpi-sessions-victor_at_obelix_0
[obelix:08876] tmp: /tmp
[obelix:08876] connect_uni: contact info read
[obelix:08876] connect_uni: connection not allowed
[obelix:08876] [0,0,0] setting up session dir with
[obelix:08876] tmpdir /tmp
[obelix:08876] universe default-universe-8876
[obelix:08876] user victor
[obelix:08876] host obelix
[obelix:08876] jobid 0
[obelix:08876] procid 0
[obelix:08876] procdir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0/0
[obelix:08876] jobdir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/0
[obelix:08876] unidir:
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876
[obelix:08876] top: openmpi-sessions-victor_at_obelix_0
[obelix:08876] tmp: /tmp
[obelix:08876] [0,0,0] contact_file
/tmp/openmpi-sessions-victor_at_obelix_0/default-universe-8876/universe-setup.txt
[obelix:08876] [0,0,0] wrote setup file
[obelix:08876] pls:rsh: local csh: 0, local bash: 1
[obelix:08876] pls:rsh: assuming same remote shell as local shell
[obelix:08876] pls:rsh: remote csh: 0, remote bash: 1
[obelix:08876] pls:rsh: final template argv:
[obelix:08876] pls:rsh: /usr/bin/ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
<template> --universe victor_at_obelix:default-universe-8876 --nsreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
--mpi-call-yield 0
[obelix:08876] pls:rsh: launching on node localhost
[obelix:08876] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to
1 (1 4)
[obelix:08876] pls:rsh: localhost is a LOCAL node
[obelix:08876] pls:rsh: changing to directory /home/victor
[obelix:08876] pls:rsh: executing: orted --debug --bootproxy 1 --name
0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe
victor_at_obelix:default-universe-8876 --nsreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111" --gprreplica
"0.0.0;tcp://136.159.56.131:55111;tcp://192.168.1.1:55111"
--mpi-call-yield 1
[obelix:08877] [0,0,1] setting up session dir with
[obelix:08877] universe default-universe-8876
[obelix:08877] user victor
[obelix:08877] host localhost
[obelix:08877] jobid 0
[obelix:08877] procid 1
[obelix:08877] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0/1
[obelix:08877] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/0
[obelix:08877] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08877] top: openmpi-sessions-victor_at_localhost_0
[obelix:08877] tmp: /tmp
[obelix:08878] [0,1,0] setting up session dir with
[obelix:08878] universe default-universe-8876
[obelix:08878] user victor
[obelix:08878] host localhost
[obelix:08878] jobid 1
[obelix:08878] procid 0
[obelix:08878] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/0
[obelix:08878] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08878] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08878] top: openmpi-sessions-victor_at_localhost_0
[obelix:08878] tmp: /tmp
[obelix:08879] [0,1,1] setting up session dir with
[obelix:08879] universe default-universe-8876
[obelix:08879] user victor
[obelix:08879] host localhost
[obelix:08879] jobid 1
[obelix:08879] procid 1
[obelix:08879] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/1
[obelix:08879] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08879] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08879] top: openmpi-sessions-victor_at_localhost_0
[obelix:08879] tmp: /tmp
[obelix:08880] [0,1,2] setting up session dir with
[obelix:08880] universe default-universe-8876
[obelix:08880] user victor
[obelix:08880] host localhost
[obelix:08880] jobid 1
[obelix:08880] procid 2
[obelix:08880] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/2
[obelix:08880] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08880] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08880] top: openmpi-sessions-victor_at_localhost_0
[obelix:08880] tmp: /tmp
[obelix:08881] [0,1,3] setting up session dir with
[obelix:08881] universe default-universe-8876
[obelix:08881] user victor
[obelix:08881] host localhost
[obelix:08881] jobid 1
[obelix:08881] procid 3
[obelix:08881] procdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1/3
[obelix:08881] jobdir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876/1
[obelix:08881] unidir:
/tmp/openmpi-sessions-victor_at_localhost_0/default-universe-8876
[obelix:08881] top: openmpi-sessions-victor_at_localhost_0
[obelix:08881] tmp: /tmp
[obelix:08876] spawn: in job_state_callback(jobid = 1, state = 0x4)
[obelix:08876] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 4
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, ./model-0.0.9, 8878)
    (i, host, exe, pid) = (1, localhost, ./model-0.0.9, 8879)
    (i, host, exe, pid) = (2, localhost, ./model-0.0.9, 8880)
    (i, host, exe, pid) = (3, localhost, ./model-0.0.9, 8881)
[obelix:08878] [0,1,0] ompi_mpi_init completed
[obelix:08879] [0,1,1] ompi_mpi_init completed
[obelix:08880] [0,1,2] ompi_mpi_init completed
[obelix:08881] [0,1,3] ompi_mpi_init completed
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_ABORTED)
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_TERMINATED)
[obelix:08877] sess_dir_finalize: job session dir not empty - leaving
[obelix:08877] sess_dir_finalize: found proc session dir empty - deleting
[obelix:08877] sess_dir_finalize: found job session dir empty - deleting
[obelix:08877] sess_dir_finalize: univ session dir not empty - leaving

Thank you,
Victor Prosolin.