Hello Brian (and all),

Well, the joy was short lived. On a 12 CPU Enterprise machine and on a 4 CPU one, I seem to be able to start up to 4 processes. Above 4, I seem to inevitably get BUS_ADRALN (Bus collisions?). Below are some traces of the failling runs as well as a detailed (mpirun -d) of one of these situations and ompi_info output. Obviously, don't hesitate to ask if more information is requred.

Buid version: openmpi-1.1b5r10421

Config parameters:

Open MPI config.status 1.1b5

configured by ./configure, generated by GNU Autoconf 2.59,

with options \"'--cache-file=config.cache' 'CFLAGS=-mcpu=v9' 'CXXFLAGS=-mcpu=v9' 'FFLAGS=-mcpu=v9' '--prefix=/export/lca/home/lca0/etudiants/ac38820/openmp

i_sun4u' --enable-ltdl-convenience\"

The traces:

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 10 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2f4f04

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 8 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b354c

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 6 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b1ecc

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b12cc

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 4 mandelbrot-mpi 100 400 400

maxiter = 100, width = 400, height = 400

execution time in seconds = 1.48

Taper q pour quitter le programme, autrement, on fait un refresh

q

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b12cc

*** End of error message ***

I also got the same behaviour on a different machine (with the exact same code base, $HOME is an NFS mount) and same hardware but limited to 4 CPUs. The following is a debug run of such the failling execution:

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -d -v -np 5 mandelbrot-mpi 100 400 400

[enterprise:24786] [0,0,0] setting up session dir with

[enterprise:24786] universe default-universe

[enterprise:24786] user sshd

[enterprise:24786] host enterprise

[enterprise:24786] jobid 0

[enterprise:24786] procid 0

[enterprise:24786] procdir: /tmp/openmpi-sessions-sshd@enterprise_0/default-universe/0/0

[enterprise:24786] jobdir: /tmp/openmpi-sessions-sshd@enterprise_0/default-universe/0

[enterprise:24786] unidir: /tmp/openmpi-sessions-sshd@enterprise_0/default-universe

[enterprise:24786] top: openmpi-sessions-sshd@enterprise_0

[enterprise:24786] tmp: /tmp

[enterprise:24786] [0,0,0] contact_file /tmp/openmpi-sessions-sshd@enterprise_0/default-universe/universe-setup.txt

[enterprise:24786] [0,0,0] wrote setup file

[enterprise:24786] pls:rsh: local csh: 0, local bash: 0

[enterprise:24786] pls:rsh: assuming same remote shell as local shell

[enterprise:24786] pls:rsh: remote csh: 0, remote bash: 0

[enterprise:24786] pls:rsh: final template argv:

[enterprise:24786] pls:rsh: /usr/local/bin/ssh <template> ( ! [ -e ./.profile ] || . ./.profile; orted --debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename <template> --universe sshd@enterprise:default-universe --nsreplica "0.0.0;tcp://10.45.117.37:40236" --gprreplica "0.0.0;tcp://10.45.117.37:40236" --mpi-call-yield 0 )

[enterprise:24786] pls:rsh: launching on node localhost

[enterprise:24786] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (1 5)

[enterprise:24786] pls:rsh: localhost is a LOCAL node

[enterprise:24786] pls:rsh: reset PATH: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/bin:/bin:/usr/local/bin:/usr/bin:/usr/sbin:/usr/ccs/bin:/usr/dt/bin:/usr/local/lam-mpi/7.1.1/bin:/export/lca/appl/Forte/SUNWspro/WS6U2/bin:/opt/sfw/bin:/usr/bin:/usr/ucb:/etc:/usr/local/bin:.

[enterprise:24786] pls:rsh: reset LD_LIBRARY_PATH: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/lib:/export/lca/appl/Forte/SUNWspro/WS6U2/lib:/usr/local/lib:/usr/local/lam-mpi/7.1.1/lib:/opt/sfw/lib

[enterprise:24786] pls:rsh: changing to directory /export/lca/home/lca0/etudiants/ac38820

[enterprise:24786] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe sshd@enterprise:default-universe --nsreplica "0.0.0;tcp://10.45.117.37:40236" --gprreplica "0.0.0;tcp://10.45.117.37:40236" --mpi-call-yield 1

[enterprise:24787] [0,0,1] setting up session dir with

[enterprise:24787] universe default-universe

[enterprise:24787] user sshd

[enterprise:24787] host localhost

[enterprise:24787] jobid 0

[enterprise:24787] procid 1

[enterprise:24787] procdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/0/1

[enterprise:24787] jobdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/0

[enterprise:24787] unidir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe

[enterprise:24787] top: openmpi-sessions-sshd@localhost_0

[enterprise:24787] tmp: /tmp

[enterprise:24789] [0,1,0] setting up session dir with

[enterprise:24789] universe default-universe

[enterprise:24789] user sshd

[enterprise:24789] host localhost

[enterprise:24789] jobid 1

[enterprise:24789] procid 0

[enterprise:24789] procdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1/0

[enterprise:24789] jobdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1

[enterprise:24789] unidir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe

[enterprise:24789] top: openmpi-sessions-sshd@localhost_0

[enterprise:24789] tmp: /tmp

[enterprise:24791] [0,1,1] setting up session dir with

[enterprise:24791] universe default-universe

[enterprise:24791] user sshd

[enterprise:24791] host localhost

[enterprise:24791] jobid 1

[enterprise:24791] procid 1

[enterprise:24791] procdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1/1

[enterprise:24791] jobdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1

[enterprise:24791] unidir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe

[enterprise:24791] top: openmpi-sessions-sshd@localhost_0

[enterprise:24791] tmp: /tmp

[enterprise:24793] [0,1,2] setting up session dir with

[enterprise:24793] universe default-universe

[enterprise:24793] user sshd

[enterprise:24793] host localhost

[enterprise:24793] jobid 1

[enterprise:24793] procid 2

[enterprise:24793] procdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1/2

[enterprise:24793] jobdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1

[enterprise:24793] unidir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe

[enterprise:24793] top: openmpi-sessions-sshd@localhost_0

[enterprise:24793] tmp: /tmp

[enterprise:24795] [0,1,3] setting up session dir with

[enterprise:24795] universe default-universe

[enterprise:24795] user sshd

[enterprise:24795] host localhost

[enterprise:24795] jobid 1

[enterprise:24795] procid 3

[enterprise:24795] procdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1/3

[enterprise:24795] jobdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1

[enterprise:24795] unidir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe

[enterprise:24795] top: openmpi-sessions-sshd@localhost_0

[enterprise:24795] tmp: /tmp

[enterprise:24797] [0,1,4] setting up session dir with

[enterprise:24797] universe default-universe

[enterprise:24797] user sshd

[enterprise:24797] host localhost

[enterprise:24797] jobid 1

[enterprise:24797] procid 4

[enterprise:24797] procdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1/4

[enterprise:24797] jobdir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe/1

[enterprise:24797] unidir: /tmp/openmpi-sessions-sshd@localhost_0/default-universe

[enterprise:24797] top: openmpi-sessions-sshd@localhost_0

[enterprise:24797] tmp: /tmp

[enterprise:24786] spawn: in job_state_callback(jobid = 1, state = 0x4)

[enterprise:24786] Info: Setting up debugger process table for applications

MPIR_being_debugged = 0

MPIR_debug_gate = 0

MPIR_debug_state = 1

MPIR_acquired_pre_main = 0

MPIR_i_am_starter = 0

MPIR_proctable_size = 5

MPIR_proctable:

(i, host, exe, pid) = (0, localhost, mandelbrot-mpi, 24789)

(i, host, exe, pid) = (1, localhost, mandelbrot-mpi, 24791)

(i, host, exe, pid) = (2, localhost, mandelbrot-mpi, 24793)

(i, host, exe, pid) = (3, localhost, mandelbrot-mpi, 24795)

(i, host, exe, pid) = (4, localhost, mandelbrot-mpi, 24797)

[enterprise:24789] [0,1,0] ompi_mpi_init completed

[enterprise:24791] [0,1,1] ompi_mpi_init completed

[enterprise:24793] [0,1,2] ompi_mpi_init completed

[enterprise:24795] [0,1,3] ompi_mpi_init completed

[enterprise:24797] [0,1,4] ompi_mpi_init completed

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b12cc

*** End of error message ***

[enterprise:24787] sess_dir_finalize: found proc session dir empty - deleting

[enterprise:24787] sess_dir_finalize: job session dir not empty - leaving

[enterprise:24787] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED)

[enterprise:24787] sess_dir_finalize: found job session dir empty - deleting

[enterprise:24787] sess_dir_finalize: univ session dir not empty - leaving

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24789

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24791

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24793

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24795

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24797

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24789

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24791

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24793

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24795

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

WARNING: A process refused to die!

Host: enterprise

PID: 24797

This process may still be running and/or consuming resources.

--------------------------------------------------------------------------

[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving

[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving

[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving

[enterprise:24787] sess_dir_finalize: proc session dir not empty - leaving

[enterprise:24787] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED)

[enterprise:24787] sess_dir_finalize: found proc session dir empty - deleting

[enterprise:24787] sess_dir_finalize: found job session dir empty - deleting

[enterprise:24787] sess_dir_finalize: found univ session dir empty - deleting

[enterprise:24787] sess_dir_finalize: found top session dir empty - deleting

ompi_info output:

sshd@enterprise ~ $ ~/openmpi_sun4u/bin/ompi_info

Open MPI: 1.1b5r10421

Open MPI SVN revision: r10421

Open RTE: 1.1b5r10421

Open RTE SVN revision: r10421

OPAL: 1.1b5r10421

OPAL SVN revision: r10421

Prefix: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u

Configured architecture: sparc-sun-solaris2.8

Configured by: sshd

Configured on: Tue Jun 20 15:25:44 EDT 2006

Configure host: averoes

Built by: ac38820

Built on: Tue Jun 20 15:59:47 EDT 2006

Built host: averoes

C bindings: yes

C++ bindings: yes

Fortran77 bindings: yes (all)

Fortran90 bindings: no

Fortran90 bindings size: na

C compiler: gcc

C compiler absolute: /usr/local/bin/gcc

C++ compiler: g++

C++ compiler absolute: /usr/local/bin/g++

Fortran77 compiler: g77

Fortran77 compiler abs: /usr/local/bin/g77

Fortran90 compiler: f90

Fortran90 compiler abs: /export/lca/appl/Forte/SUNWspro/WS6U2/bin/f90

C profiling: yes

C++ profiling: yes

Fortran77 profiling: yes

Fortran90 profiling: no

C++ exceptions: no

Thread support: solaris (mpi: no, progress: no)

Internal debug support: no

MPI parameter check: runtime

Memory profiling support: no

Memory debugging support: no

libltdl support: yes

MCA paffinity: solaris (MCA v1.0, API v1.0, Component v1.1)

MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1)

MCA timer: solaris (MCA v1.0, API v1.0, Component v1.1)

MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)

MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)

MCA coll: basic (MCA v1.0, API v1.0, Component v1.1)

MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1)

MCA coll: self (MCA v1.0, API v1.0, Component v1.1)

MCA coll: sm (MCA v1.0, API v1.0, Component v1.1)

MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1)

MCA io: romio (MCA v1.0, API v1.0, Component v1.1)

MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1)

MCA pml: dr (MCA v1.0, API v1.0, Component v1.1)

MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1)

MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1)

MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1)

MCA btl: self (MCA v1.0, API v1.0, Component v1.1)

MCA btl: sm (MCA v1.0, API v1.0, Component v1.1)

MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)

MCA topo: unity (MCA v1.0, API v1.0, Component v1.1)

MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)

MCA gpr: null (MCA v1.0, API v1.0, Component v1.1)

MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1)

MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1)

MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1)

MCA iof: svc (MCA v1.0, API v1.0, Component v1.1)

MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1)

MCA ns: replica (MCA v1.0, API v1.0, Component v1.1)

MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)

MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1)

MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1)

MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1)

MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1)

MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1)

MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1)

MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1)

MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1)

MCA rml: oob (MCA v1.0, API v1.0, Component v1.1)

MCA pls: fork (MCA v1.0, API v1.0, Component v1.1)

MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1)

MCA sds: env (MCA v1.0, API v1.0, Component v1.1)

MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1)

MCA sds: seed (MCA v1.0, API v1.0, Component v1.1)

MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1)

Le mardi 20 juin 2006 17:06, Eric Thibodeau a écrit :

> Thanks for the pointer, it WORKS!! (yay)

>

> Le mardi 20 juin 2006 12:21, Brian Barrett a écrit :

> > On Jun 19, 2006, at 12:15 PM, Eric Thibodeau wrote:

> >

> > > I checked the thread with the same title as this e-mail and tried

> > > compiling openmpi-1.1b4r10418 with:

> > >

> > > ./configure CFLAGS="-mv8plus" CXXFLAGS="-mv8plus" FFLAGS="-mv8plus"

> > > FCFLAGS="-mv8plus" --prefix=$HOME/openmpi-SUN-`uname -r` --enable-

> > > pretty-print-stacktrace

> > I put the incorrect flags in the error message - can you try again with:

> >

> >

> > ./configure CFLAGS=-mcpu=v9 CXXFLAGS=-mcpu=v9 FFLAGS=-mcpu=v9

> > FCFLAGS=-mcpu=v9 --prefix=$HOME/openmpi-SUN-`uname -r` --enable-

> > pretty-print-stacktrace

> >

> >

> > and see if that helps? By the way, I'm not sure if Solaris has the

> > required support for the pretty-print stack trace feature. It likely

> > will print what signal caused the error, but will not actually print

> > the stack trace. It's enabled by default on Solaris, with this

> > limited functionality (the option exists for platforms that have

> > broken half-support for GNU libc's stack trace feature, and for users

> > that don't like us registering a signal handler to do the work).

> >

> > Brian

> >

> >

>

--

Eric Thibodeau

Neural Bucket Solutions Inc.

T. (514) 736-1436

C. (514) 710-0517