With both leave_pinned and use_mem_hook enabled on a linpack run we get the assertion error on the memory callback in linpack. That is to say, there is a free occurring in the middle of a registration.
| Existing registrations: | |
| Base | Bound | Length |
| 241615360 | 244841607 | 3226248 |
| 244841608 | 246428807 | 1587200 |
| 246428808 | 248016007 | 1587200 |
| 248019648 | 251245895 | 3226248 |
| | |
| Tyring to free | | |
| 247917216 | | |
| From | | |
| Base | Bound | |
| 246428808 | 248016007 | |
When we get the assert, we are trying to free: 247917216, which is in the middle of the registration. Note we have NOT resized any registrations so I am confident there is not an issue with either the tree or the resize at least as far as linpack is concerned.
Here is the callstack:
#0 0x0000002a95f079c9 in raise () from /lib/libc.so.6
#1 0x0000002a95f08e6e in abort () from /lib/libc.so.6
#2 0x0000002a95f01690 in __assert_fail () from /lib/libc.so.6
#3 0x0000002a9571b200 in mca_mpool_base_mem_cb (base=0xec6eaa0, size=31624,
cbdata=0x0) at mpool_base_mem_cb.c:53
#4 0x0000002a9587fe0d in opal_mem_free_release_hook (buf=0xec6eaa0,
length=31624) at memory.c:121
#5 0x0000002a9588bd12 in opal_mem_free_free_hook (ptr=0xec6eaa0,
caller=0x42b052) at memory_malloc_hooks.c:66
#6 0x000000000042b052 in ATL_dmmIJK ()
#7 0x000000000064f9b1 in ATL_dgemmNN ()
#8 0x000000000057722b in ATL_dgemmNN_RB ()
#9 0x0000000000577fc3 in ATL_rtrsmRUN ()
#10 0x000000000042c63c in ATL_dtrsm ()
#11 0x0000000000423c1e in atl_f77wrap_dtrsm__ ()
#12 0x0000000000423a94 in dtrsm_ ()
#13 0x0000000000411192 in HPL_dtrsm (ORDER=17933, SIDE=17933, UPLO=8,
TRANS=4294967295, DIAG=0, M=23458672, N=0, ALPHA=1, A=0x7fbfffefa0, LDA=0,
B=0x202, LDB=0) at HPL_dtrsm.c:949
#14 0x000000000040cfb6 in HPL_pdupdateTT (PBCST=0x0, IFLAG=0x0,
PANEL=0x165f040, NN=-1) at HPL_pdupdateTT.c:362
#15 0x000000000041936f in HPL_pdgesvK2 (GRID=0x7fbffff4a0, ALGO=0x7fbffff460,
A=0x7fbffff260) at HPL_pdgesvK2.c:178
#16 0x000000000040d6f7 in HPL_pdgesv (GRID=0x7fbffff4a0, ALGO=0x460d,
A=0x7fbffff260) at HPL_pdgesv.c:107
#17 0x0000000000405b10 in HPL_pdtest (TEST=0x7fbffff430, GRID=0x7fbffff4a0,
ALGO=0x7fbffff460, N=10000, NB=80) at HPL_pdtest.c:193
#18 0x0000000000401840 in main (ARGC=1, ARGV=0x7fbffff928)
at HPL_pddriver.c:223
Note that the free occurs in the ATLAS libraries, I will look into re-building linpack with another BLAS library to see what happens. Any other suggestions?
Thanks,
Galen