On Mon, Aug 13, 2007 at 05:00:37PM +0300, Pavel Shamis (Pasha) wrote:
> Jeff Squyres wrote:
> > FWIW: we fixed this recently in the openib BTL by ensuring that all
> > registered memory is freed during the BTL finalize (vs. the mpool
> > finalize).
> > This is a new issue because the mpool finalize was just recently
> > expanded to un-register all of its memory as part of the NIC-restart
> > effort (and will likely also be needed for checkpoint/restart...?).
> mpool rdma finalize was empty function. I changed it to do the
> "finalize" job - go over all registered segments in mpool and release
> them one by one,
> Mpool use reference counter for each memory region and it prevents us
> from double free bug. In openib btl all memory that was registered with
> mpool on finalize stage will be unregistered with mpool too.
> So maybe in gm the memory (that was registred with mpool) released
> directly (not via mpool) and it cause the segfault.
As far as I understand the problem Tim see is much more serious. During
finalize gm BTL is unloaded and only after that mpool finalize is
called. Mpool uses callbacks into gm BTL to register/unregister memory,
but BTL is not there already.
> > On Aug 13, 2007, at 9:11 AM, Tim Prins wrote:
> >> Hi folks,
> >> I have run into a problem with mca_mpool_rdma_finalize as
> >> implemented in
> >> r15557. With the t_win onesided test, running over gm, it
> >> segfaults. What
> >> appears to be happening is that some memory is registered with gm,
> >> and then
> >> gets freed by mca_mpool_rdma_finalize. But the free function that
> >> it is using
> >> is in the gm btl, and the btls are unloaded before the mpool is
> >> shut down. So
> >> the function call segfaults.
> >> If I change the code so we never unload the btls (and we don't free
> >> the gm
> >> port), it works fine.
> >> Note that the openib btl works just fine.
> >> Forgive me if this is a known problem, I am trying to catch up from my
> >> vacation...
> >> Tim
> >> ---
> >> If anyone cares, here is the callstack:
> >> (gdb) bt
> >> #0 0x404de825 in ?? () from /lib/libgcc_s.so.1
> >> #1 0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690)
> >> at mpool_rdma_module.c:431
> >> #2 0x400caca9 in mca_mpool_base_close () at base/
> >> mpool_base_close.c:57
> >> #3 0x40060094 in ompi_mpi_finalize () at runtime/
> >> ompi_mpi_finalize.c:304
> >> #4 0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44
> >> #5 0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214
> >> (gdb)
> >> gdb shows that at this point the gm btl is no longer loaded.
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> devel mailing list