Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Kevin Radican (radicak_at_[hidden])
Date: 2006-11-02 07:52:25


Hi,

I have a SEGV problem with Scalapack. The same configuration works fine with
MPICH, but I seem to get much better performance with Openmpi on this machine.
I have attached the log and slmake.inc I am using. I have a the same problem
with programs that call this routine that xcdblu uses. It seems to occur when
the number of processors doesn't match the number of diagonals for the case of
bwl = 15. If i choose -np 15 it just seems to seems to hang, however if i use
mpirun --mca mpi_paffinity_alone 1 -np 15 xcdblu it crashes too.

Any help would be appreciated.

Regards,
Kevin

> mpirun -np 6 xcdblu
SCALAPACK banded linear systems.
'MPI machine'

Tests of the parallel complex single precision band matrix solve
The following scaled residual checks will be computed:
 Solve residual = ||Ax - b|| / (||x|| * ||A|| * eps * N)
 Factorization residual = ||A - LU|| / (||A|| * eps * N)
The matrix A is randomly generated for each test.

An explanation of the input/output parameters follows:
TIME : Indicates whether WALL or CPU time was used.
N : The number of rows and columns in the matrix A.
bwl, bwu : The number of diagonals in the matrix A.
NB : The size of the column panels the matrix A is split into. [-1 for
default]
NRHS : The total number of RHS to solve for.
NBRHS : The number of RHS to be put on a column of processes before going
          on to the next column of processes.
P : The number of process rows.
Q : The number of process columns.
THRESH : If a residual value is less than THRESH, CHECK is flagged as PASSED
Fact time: Time in seconds to factor the matrix
Sol Time: Time in seconds to solve the system.
MFLOPS : Rate of execution for factor and solve using sequential operation
count.
MFLOP2 : Rough estimate of speed using actual op count (accurate big P,N).

The following parameter values will be used:
  N : 3 5 17
  bwl : 1 3 15
  bwu : 1 1 4
  NB : -1
  NRHS : 4
  NBRHS: 1
  P : 1 1 1 1
  Q : 1 2 3 4

Relative machine precision (eps) is taken to be 0.596046E-07
Routines pass computational tests if scaled residual is less than 3.0000

TIME TR N BWL BWU NB NRHS P Q L*U Time Slv Time MFLOPS
MFLOP2 CHECK
---- -- ------ --- --- ---- ----- ---- ---- -------- -------- --------
-------- ------

WALL N 3 1 1 3 4 1 1 0.000 0.0000 0.00 0.00
PASSED
WALL N 5 1 1 5 4 1 1 0.000 0.0000 0.00 0.00
PASSED
WALL N 5 3 1 5 4 1 1 0.000 0.0000 0.00 0.00
PASSED
WALL N 17 1 1 17 4 1 1 0.000 0.0000 0.00 0.00
PASSED
WALL N 17 3 1 17 4 1 1 0.000 0.0000 0.00 0.00
PASSED
WALL N 17 15 4 17 4 1 1 0.000 0.0000 0.00 0.00
PASSED
WALL N 3 1 1 2 4 1 2 0.000 0.0000 0.00 0.00
PASSED
WALL N 5 1 1 3 4 1 2 0.000 0.0000 0.00 0.00
PASSED
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x10
[0] func:/usr/local/lib/libopal.so.0 [0x2b0fdb4ee1c0]
[1] func:/lib64/libpthread.so.0 [0x2b0fdbe0d140]
[2]
func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_match+0x2ff)
[0x2b0fde2a4d9f]
[3]
func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback+0xaf)
[0x2b0fde2a5d8f]
[4]
func:/usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x8c9)
[0x2b0fde5b9e39]
[5] func:/usr/local/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x21)
[0x2b0fde3aeff1]
[6] func:/usr/local/lib/libopal.so.0(opal_progress+0x4a) [0x2b0fdb4d9bfa]
[7] func:/usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0x265)
[0x2b0fde2a2c75]
[8]
func:/usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_basic_linear+0x10b)
[0x2b0fdebe544b]
[9]
func:/usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x4d)
[0x2b0fdebe25bd]
[10] func:/usr/local/lib/libmpi.so.0(ompi_comm_nextcid+0x209) [0x2b0fdb207c59]
[11] func:/usr/local/lib/libmpi.so.0(ompi_comm_create+0x8c) [0x2b0fdb206bcc]
[12] func:/usr/local/lib/libmpi.so.0(MPI_Comm_create+0x90) [0x2b0fdb22d890]
[13] func:/usr/local/lib/libmpi.so.0(pmpi_comm_create__+0x42) [0x2b0fdb2491b2]
[14] func:xcdblu(BI_TransUserComm+0xef) [0x46797f]
[15] func:xcdblu(Cblacs_gridmap+0x13a) [0x463e3a]
[16] func:xcdblu(Creshape+0x17c) [0x42365c]
[17] func:xcdblu(pcdbtrf_+0x5d9) [0x42df35]
[18] func:xcdblu(MAIN__+0x190c) [0x417a0c]
[19] func:xcdblu(main+0x32) [0x4160ea]
[20] func:/lib64/libc.so.6(__libc_start_main+0xf4) [0x2b0fdbf34154]
[21] func:xcdblu [0x416029]
*** End of error message ***
1 additional process aborted (not shown)