Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Lydia Heck (lydia.heck_at_[hidden])
Date: 2006-11-23 05:42:49


Gadget2 - I cannot attach it because it is not publicly available,
runs perfectly fine on any number of processes on systems such
as Solaris 10 - Sun CT6 gigabit, SUN CT5 and myrinet gm, IBM regatta ..

Sorry to be so expansive ...

When I run the code on 32 CPUs on openmpi, mx using the studio11 compilers
on a solaris x64 system the code works fine, until about the end, when
it fails to write all the restart files.

When I run the code on 64 CPUs it fails with an error message which is

Topnodes=218193 costlimit=0.0890015 countlimit=428.229
Before=44417
After=46281
NTopleaves= 40496 NTopnodes=46281 (space for 347252)
desired memory imbalance=2.83425 (limit=100719, needed=114185)
Note: the domain decomposition is suboptimum because the ceiling for
memory-imbalance is reached
work-load balance=1.28529 memory-balance=1.01948
exchange of 0002589387 particles
Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
Failing at addr:5192cbd0
/opt/ompi/lib/libopal.so.0.0.0:opal_backtrace_print+0x10
/opt/ompi/lib/libopal.so.0.0.0:0x99df5
/lib/amd64/libc.so.1:0xcb276
/lib/amd64/libc.so.1:0xc0642
/opt/mx/lib/amd64/libmyriexpress.so:mx__luigi+0xd5 [ Signal 11 (SEGV)]
/opt/mx/lib/amd64/libmyriexpress.so:mx_irecv+0x174
/opt/ompi/lib/openmpi/mca_mtl_mx.so:ompi_mtl_mx_irecv+0x116
/opt/ompi/lib/openmpi/mca_pml_cm.so:mca_pml_cm_irecv+0x27b
/opt/ompi/lib/libmpi.so.0.0.0:PMPI_Irecv+0x1ae
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_exchange+0x11b7
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_decompose+0x4da
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:domain_Decomposition+0x467
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:run+0x9f
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:main+0x191
/data/rw9/arj/unpack/bench_test_myri2/Gadget2-multidomain/Gadget2:0x69fc
*** End of error message ***
63 additional processes aborted (not shown)
m2001(26) > /opt/ompi/bin/mpirun -np 32 -machinefile ./myh-all -mca pml cm
./Gadget2 param.txt

As this is one of our predominant production codes, I need to make sure
that it is running on any system which I install. Any idea would be welcome.

Lydia

------------------------------------------
Dr E L Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.heck_at_[hidden]

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___________________________________________