Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Eric Thibodeau (kyron_at_[hidden])
Date: 2006-06-20 20:35:55


Hello Brian (and all),

        Well, the joy was short lived. On a 12 CPU Enterprise machine and on a 4 CPU one, I seem to be able to start up to 4 processes. Above 4, I seem to inevitably get BUS_ADRALN (Bus collisions?). Below are some traces of the failling runs as well as a detailed (mpirun -d) of one of these situations and ompi_info output. Obviously, don't hesitate to ask if more information is requred.

Buid version: openmpi-1.1b5r10421
Config parameters:
Open MPI config.status 1.1b5
configured by ./configure, generated by GNU Autoconf 2.59,
  with options \"'--cache-file=config.cache' 'CFLAGS=-mcpu=v9' 'CXXFLAGS=-mcpu=v9' 'FFLAGS=-mcpu=v9' '--prefix=/export/lca/home/lca0/etudiants/ac38820/openmp
i_sun4u' --enable-ltdl-convenience\"

The traces:
sshd_at_enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 10 mandelbrot-mpi 100 400 400
Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
Failing at addr:2f4f04
*** End of error message ***
sshd_at_enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 8 mandelbrot-mpi 100 400 400
Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
Failing at addr:2b354c
*** End of error message ***
sshd_at_enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 6 mandelbrot-mpi 100 400 400
Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
Failing at addr:2b1ecc
*** End of error message ***
sshd_at_enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400
Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
Failing at addr:2b12cc
*** End of error message ***
sshd_at_enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 4 mandelbrot-mpi 100 400 400
maxiter = 100, width = 400, height = 400
execution time in seconds = 1.48
Taper q pour quitter le programme, autrement, on fait un refresh
q
sshd_at_enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400
Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
Failing at addr:2b12cc
*** End of error message ***

I also got the same behaviour on a different machine (with the exact same code base, $HOME is an NFS mount) and same hardware but limited to 4 CPUs. The following is a debug run of such the failling execution:

sshd_at_enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -d -v -np 5 mandelbrot-mpi 100 400 400
[enterprise:24786] [0,0,0] setting up session dir with
[enterprise:24786] universe default-universe
[enterprise:24786] user sshd
[enterprise:24786] host enterprise
[enterprise:24786] jobid 0
[enterprise:24786] procid 0
[enterprise:24786] procdir: /tmp/openmpi-sessions-sshd_at_enterprise_0/default-universe/0/0
[enterprise:24786] jobdir: /tmp/openmpi-sessions-sshd_at_enterprise_0/default-universe/0
[enterprise:24786] unidir: /tmp/openmpi-sessions-sshd_at_enterprise_0/default-universe
[enterprise:24786] top: openmpi-sessions-sshd_at_enterprise_0
[enterprise:24786] tmp: /tmp
[enterprise:24786] [0,0,0] contact_file /tmp/openmpi-sessions-sshd_at_enterprise_0/default-universe/universe-setup.txt
[enterprise:24786] [0,0,0] wrote setup file
[enterprise:24786] pls:rsh: local csh: 0, local bash: 0
[enterprise:24786] pls:rsh: assuming same remote shell as local shell
[enterprise:24786] pls:rsh: remote csh: 0, remote bash: 0
[enterprise:24786] pls:rsh: final template argv:
[enterprise:24786] pls:rsh: /usr/local/bin/ssh <template> ( ! [ -e ./.profile ] || . ./.profile; orted --debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename <template> --universe sshd_at_enterprise:default-universe --nsreplica "0.0.0;tcp://10.45.117.37:40236" --gprreplica "0.0.0;tcp://10.45.117.37:40236" --mpi-call-yield 0 )
[enterprise:24786] pls:rsh: launching on node localhost
[enterprise:24786] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (1 5)
[enterprise:24786] pls:rsh: localhost is a LOCAL node
[enterprise:24786] pls:rsh: reset PATH: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/bin:/bin:/usr/local/bin:/usr/bin:/usr/sbin:/usr/ccs/bin:/usr/dt/bin:/usr/local/lam-mpi/7.1.1/bin:/export/lca/appl/Forte/SUNWspro/WS6U2/bin:/opt/sfw/bin:/usr/bin:/usr/ucb:/etc:/usr/local/bin:.
[enterprise:24786] pls:rsh: reset LD_LIBRARY_PATH: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/lib:/export/lca/appl/Forte/SUNWspro/WS6U2/lib:/usr/local/lib:/usr/local/lam-mpi/7.1.1/lib:/opt/sfw/lib
[enterprise:24786] pls:rsh: changing to directory /export/lca/home/lca0/etudiants/ac38820
[enterprise:24786] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe sshd_at_enterprise:default-universe --nsreplica "0.0.0;tcp://10.45.117.37:40236" --gprreplica "0.0.0;tcp://10.45.117.37:40236" --mpi-call-yield 1
[enterprise:24787] [0,0,1] setting up session dir with
[enterprise:24787] universe default-universe
[enterprise:24787] user sshd
[enterprise:24787] host localhost
[enterprise:24787] jobid 0
[enterprise:24787] procid 1
[enterprise:24787] procdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/0/1
[enterprise:24787] jobdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/0
[enterprise:24787] unidir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe
[enterprise:24787] top: openmpi-sessions-sshd_at_localhost_0
[enterprise:24787] tmp: /tmp
[enterprise:24789] [0,1,0] setting up session dir with
[enterprise:24789] universe default-universe
[enterprise:24789] user sshd
[enterprise:24789] host localhost
[enterprise:24789] jobid 1
[enterprise:24789] procid 0
[enterprise:24789] procdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1/0
[enterprise:24789] jobdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1
[enterprise:24789] unidir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe
[enterprise:24789] top: openmpi-sessions-sshd_at_localhost_0
[enterprise:24789] tmp: /tmp
[enterprise:24791] [0,1,1] setting up session dir with
[enterprise:24791] universe default-universe
[enterprise:24791] user sshd
[enterprise:24791] host localhost
[enterprise:24791] jobid 1
[enterprise:24791] procid 1
[enterprise:24791] procdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1/1
[enterprise:24791] jobdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1
[enterprise:24791] unidir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe
[enterprise:24791] top: openmpi-sessions-sshd_at_localhost_0
[enterprise:24791] tmp: /tmp
[enterprise:24793] [0,1,2] setting up session dir with
[enterprise:24793] universe default-universe
[enterprise:24793] user sshd
[enterprise:24793] host localhost
[enterprise:24793] jobid 1
[enterprise:24793] procid 2
[enterprise:24793] procdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1/2
[enterprise:24793] jobdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1
[enterprise:24793] unidir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe
[enterprise:24793] top: openmpi-sessions-sshd_at_localhost_0
[enterprise:24793] tmp: /tmp
[enterprise:24795] [0,1,3] setting up session dir with
[enterprise:24795] universe default-universe
[enterprise:24795] user sshd
[enterprise:24795] host localhost
[enterprise:24795] jobid 1
[enterprise:24795] procid 3
[enterprise:24795] procdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1/3
[enterprise:24795] jobdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1
[enterprise:24795] unidir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe
[enterprise:24795] top: openmpi-sessions-sshd_at_localhost_0
[enterprise:24795] tmp: /tmp
[enterprise:24797] [0,1,4] setting up session dir with
[enterprise:24797] universe default-universe
[enterprise:24797] user sshd
[enterprise:24797] host localhost
[enterprise:24797] jobid 1
[enterprise:24797] procid 4
[enterprise:24797] procdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1/4
[enterprise:24797] jobdir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe/1
[enterprise:24797] unidir: /tmp/openmpi-sessions-sshd_at_localhost_0/default-universe
[enterprise:24797] top: openmpi-sessions-sshd_at_localhost_0
[enterprise:24797] tmp: /tmp
[enterprise:24786] spawn: in job_state_callback(jobid = 1, state = 0x4)
[enterprise:24786] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 5
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, mandelbrot-mpi, 24789)
    (i, host, exe, pid) = (1, localhost, mandelbrot-mpi, 24791)
    (i, host, exe, pid) = (2, localhost, mandelbrot-mpi, 24793)
    (i, host, exe, pid) = (3, localhost, mandelbrot-mpi, 24795)
    (i, host, exe, pid) = (4, localhost, mandelbrot-mpi, 24797)
[enterprise:24789] [0,1,0] ompi_mpi_init completed
[enterprise:24791] [0,1,1] ompi_mpi_init completed
[enterprise:24793] [0,1,2] ompi_mpi_init completed
[enterprise:24795] [0,1,3] ompi_mpi_init completed
[enterprise:24797] [0,1,4] ompi_mpi_init completed
Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
Failing at addr:2b12cc
*** End of error message ***
[enterprise:24787] sess_dir_finalize: found proc session dir empty - deleting
[enterprise:24787] sess_dir_finalize: job session dir not empty - leaving
[enterprise:24787] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED)
[enterprise:24787] sess_dir_finalize: found job session dir empty - deleting
[enterprise:24787] sess_dir_finalize: univ session dir not empty - leaving
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24789

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24791

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24793

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24795

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24797

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24789

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24791

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24793

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24795

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: enterprise
PID: 24797

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving
[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving
[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving
[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving
[enterprise:24787] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED)
[enterprise:24787] sess_dir_finalize: found proc session dir empty - deleting
[enterprise:24787] sess_dir_finalize: found job session dir empty - deleting
[enterprise:24787] sess_dir_finalize: found univ session dir empty - deleting
[enterprise:24787] sess_dir_finalize: found top session dir empty - deleting

ompi_info output:
sshd_at_enterprise ~ $ ~/openmpi_sun4u/bin/ompi_info
                Open MPI: 1.1b5r10421
   Open MPI SVN revision: r10421
                Open RTE: 1.1b5r10421
   Open RTE SVN revision: r10421
                    OPAL: 1.1b5r10421
       OPAL SVN revision: r10421
                  Prefix: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u
 Configured architecture: sparc-sun-solaris2.8
           Configured by: sshd
           Configured on: Tue Jun 20 15:25:44 EDT 2006
          Configure host: averoes
                Built by: ac38820
                Built on: Tue Jun 20 15:59:47 EDT 2006
              Built host: averoes
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: no
 Fortran90 bindings size: na
              C compiler: gcc
     C compiler absolute: /usr/local/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/local/bin/g++
      Fortran77 compiler: g77
  Fortran77 compiler abs: /usr/local/bin/g77
      Fortran90 compiler: f90
  Fortran90 compiler abs: /export/lca/appl/Forte/SUNWspro/WS6U2/bin/f90
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: no
          C++ exceptions: no
          Thread support: solaris (mpi: no, progress: no)
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
           MCA paffinity: solaris (MCA v1.0, API v1.0, Component v1.1)
           MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1)
               MCA timer: solaris (MCA v1.0, API v1.0, Component v1.1)
           MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: basic (MCA v1.0, API v1.0, Component v1.1)
                MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1)
                MCA coll: self (MCA v1.0, API v1.0, Component v1.1)
                MCA coll: sm (MCA v1.0, API v1.0, Component v1.1)
                MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1)
                  MCA io: romio (MCA v1.0, API v1.0, Component v1.1)
               MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1)
                 MCA pml: dr (MCA v1.0, API v1.0, Component v1.1)
                 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1)
                 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1)
              MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1)
                 MCA btl: self (MCA v1.0, API v1.0, Component v1.1)
                 MCA btl: sm (MCA v1.0, API v1.0, Component v1.1)
                 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
                MCA topo: unity (MCA v1.0, API v1.0, Component v1.1)
                 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
                 MCA gpr: null (MCA v1.0, API v1.0, Component v1.1)
                 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1)
                 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1)
                 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1)
                 MCA iof: svc (MCA v1.0, API v1.0, Component v1.1)
                  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1)
                  MCA ns: replica (MCA v1.0, API v1.0, Component v1.1)
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1)
                 MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1)
                 MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1)
                 MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1)
                 MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1)
               MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1)
                MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1)
                MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.1)
                 MCA pls: fork (MCA v1.0, API v1.0, Component v1.1)
                 MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1)
                 MCA sds: env (MCA v1.0, API v1.0, Component v1.1)
                 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1)
                 MCA sds: seed (MCA v1.0, API v1.0, Component v1.1)
                 MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1)

Le mardi 20 juin 2006 17:06, Eric Thibodeau a écrit :
> Thanks for the pointer, it WORKS!! (yay)
>
> Le mardi 20 juin 2006 12:21, Brian Barrett a écrit :
> > On Jun 19, 2006, at 12:15 PM, Eric Thibodeau wrote:
> >
> > > I checked the thread with the same title as this e-mail and tried
> > > compiling openmpi-1.1b4r10418 with:
> > >
> > > ./configure CFLAGS="-mv8plus" CXXFLAGS="-mv8plus" FFLAGS="-mv8plus"
> > > FCFLAGS="-mv8plus" --prefix=$HOME/openmpi-SUN-`uname -r` --enable-
> > > pretty-print-stacktrace
> > I put the incorrect flags in the error message - can you try again with:
> >
> >
> > ./configure CFLAGS=-mcpu=v9 CXXFLAGS=-mcpu=v9 FFLAGS=-mcpu=v9
> > FCFLAGS=-mcpu=v9 --prefix=$HOME/openmpi-SUN-`uname -r` --enable-
> > pretty-print-stacktrace
> >
> >
> > and see if that helps? By the way, I'm not sure if Solaris has the
> > required support for the pretty-print stack trace feature. It likely
> > will print what signal caused the error, but will not actually print
> > the stack trace. It's enabled by default on Solaris, with this
> > limited functionality (the option exists for platforms that have
> > broken half-support for GNU libc's stack trace feature, and for users
> > that don't like us registering a signal handler to do the work).
> >
> > Brian
> >
> >
>

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517