Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Lydia Heck (lydia.heck_at_[hidden])
Date: 2006-11-23 06:14:18


The same run on 32 CPUs almost completes, starting to write 32 re-start
files and fails with the same problem:

Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
Failing at addr:33
/opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10
/opt/ompi/lib/libopal.so.0.0.0:0x99df5
/lib/amd64/libc.so.1:0xcb276
/lib/amd64/libc.so.1:0xc0642
/opt/mx/lib/amd64/libmyriexpress.so:0x102c7 [ Signal 11 (SEGV)]
/opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0x3d
/opt/mx/lib/amd64/libmyriexpress.so:mx__test_common+0x22
/opt/mx/lib/amd64/libmyriexpress.so:mx_test+0x37
/opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_send+0x288
/opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_send+0x3fc
/opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_sendrecv_actual_localcompleted+0x85
/opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_recursivedoubling+0x1a3
/opt/ompi/lib/openmpi/mca_coll_tuned.so:ompi_coll_tuned_barrier_intra_dec_fixed+0x44
/opt/ompi/lib/libmpi.so.0.0.0:MPI_Barrier+0x9d
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:restart+0x9a0
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x219
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc
*** End of error message ***
mv: cannot access ./restart.20
31 additional processes aborted (not shown)
m2001(27) >

On Thu, 23 Nov 2006, Lydia Heck wrote:

>
> Gadget2 - I cannot attach it because it is not publicly available,
> runs perfectly fine on any number of processes on systems such
> as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta ..
>
> Sorry to be so expansive ...
>
> When I run the code on 32 CPUs on openmpi, mx using the studio11 compilers
> on a solaris x64 system the code works fine, until about the end, when
> it fails to write all the restart files.
>
> When I run the code on 64 CPUs it fails with an error message which is
>
> Topnodes=218193 costlimit=0.0890015 countlimit=428.229
> Before=44417
> After=46281
> NTopleaves= 40496 NTopnodes=46281 (space for 347252)
> desired memory imbalance=2.83425 (limit=100719, needed=114185)
> Note: the domain decomposition is suboptimum because the ceiling for
> memory-imbalance is reached
> work-load balance=1.28529 memory-balance=1.01948
> exchange of 0002589387 particles
> Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
> Failing at addr:5192cbd0
> /opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10
> /opt/ompi/lib/libopal.so.0.0.0:0x99df5
> /lib/amd64/libc.so.1:0xcb276
> /lib/amd64/libc.so.1:0xc0642
> /opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0xd5 [ Signal 11 (SEGV)]
> /opt/mx/lib/amd64/libmyriexpress.so:mx_irecv+0x174
> /opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_irecv+0x116
> /opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_irecv+0x27b
> /opt/ompi/lib/libmpi.so.0.0.0:PMPI_Irecv+0x1ae
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_exchange+0x11b7
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_decompose+0x4da
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_Decomposition+0x467
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x9f
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191
> /data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc
> *** End of error message ***
> 63 additional processes aborted (not shown)
> m2001(26) > /opt/ompi/bin/mpirun -np 32 -machinefile ./myh-all -mca pml cm
> ./Gadget2 param.txt
>
> As this is one of our predominant production codes, I need to make sure
> that it is running on any system which I install. Any idea would be welcome.
>
> Lydia
>
>
>
> ------------------------------------------
> Dr E L Heck
>
> University of Durham
> Institute for Computational Cosmology
> Ogden Centre
> Department of Physics
> South Road
>
> DURHAM, DH1 3LE
> United Kingdom
>
> e-mail: lydia.heck_at_[hidden]
>
> Tel.: + 44 191 - 334 3628
> Fax.: + 44 191 - 334 3645
> ___________________________________________
>

------------------------------------------
Dr E L Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.heck_at_[hidden]

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________