Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Code Seg Faults in Devel Series
From: Doug Roberts (roberpj_at_[hidden])
Date: 2008-06-30 16:06:26


This is resolved ;) On our system, for releases after 1.3a1r18423
up to and including the latest release in the 1.4 trunk, configure
requires the --enable-mpi-threads option to be explicitly specified
for the cpi.c problem to successfully run, as shown here:

# ./configure --prefix=/opt/testing/openmpi/1.4a1r18770 \
    --enable-mpi-threads --with-gm=/opt/gm

# mpirun -np 4 -machinefile ~/bruhosts a.out
Process 1 of 4 is on bru27
Process 3 of 4 is on bru27
Process 0 of 4 is on bru25
Process 2 of 4 is on bru25
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.004372

Otherwise not doing so yields the segfault as shown before:

# ./configure --prefix=/opt/testing/openmpi/1.4a1r18770 --with-gm=/opt/gm

# mpirun -np 4 -machinefile ~/bruhosts a.out
Process 1 of 4 is on bru27
Process 3 of 4 is on bru27
Process 0 of 4 is on bru25
[bru25:30651] *** Process received signal ***
[bru25:30651] Signal: Segmentation fault (11)
[bru25:30651] Signal code: Address not mapped (1)
[bru25:30651] Failing at address: 0x9
Process 2 of 4 is on bru25
[bru25:30651] [ 0] /lib64/tls/libpthread.so.0 [0x2a95f7e420]
[bru25:30651] [ 1]
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_btl_gm.so
[0x2a97980fb9]
[bru25:30651] [ 2]
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so
[0x2a97672c1d]
[bru25:30651] [ 3]
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so
[0x2a97667753]
[bru25:30651] [ 4]
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so
[0x2a9857db1c]
[bru25:30651] [ 5]
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so
[0x2a9857de27]
[bru25:30651] [ 6]
/opt/sharcnet/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so
[0x2a98573eec]
[bru25:30651] [ 7]
/opt/sharcnet/testing/openmpi/current/lib/libmpi.so.0(PMPI_Bcast+0x13e)
[0x2a956b405e]
[bru25:30651] [ 8] a.out(main+0xd6) [0x400d0f]
[bru25:30651] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x2a960a34bb]
[bru25:30651] [10] a.out [0x400b7a]
[bru25:30651] *** End of error message ***
[bru34:06039]
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 30651 on node bru25 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

On Fri, 27 Jun 2008, Doug Roberts wrote:

>
> Hi, I am trying to use the latest release of v1.3 to test with BLCR
> however i just noticed that sometime after 1.3a1r18423 the standard
> mpich sample code (cpi.c) stopped working on our rel4 based myrinet
> gm clusters which raises some concern.
>
> Please find attached: gm_board_info.out, ompi_info--all.out,
> ompi_info--param-btl-gm.out and config-1.4a1r18743.log bundled
> in mpi-output.tar.gz for your analysis.
>
> Below shows the sample code runs with 1.3a1r18423, but crashes with
> 1.3a1r18740 and further crashes with all snapshots greater than
> 1.3a1r18423 i have tested.