Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Some questions about checkpoint/restart (12)
From: Takayuki Seki (seki_at_[hidden])
Date: 2010-05-28 03:39:59


Hi,Josh

>https://svn.open-mpi.org/trac/ompi/ticket/2397

Thank you very much for filing my questions to ticket system.
Now I have 3 new questions and I will post them.

Regards,
Takayuki Seki

12th question is as follows:

(12) Checkpointing of an MPI job which uses two (or more?) openib btl modules fails.
     Please build Open MPI with "--enable-debug" configure option.
     Assersion fails in mca_btl_openib_ft_event.

Framework : bml
Component : r2
The source file : ompi/mca/bml/r2/bml_r2_ft.c
The function name : mca_bml_r2_ft_event

Framework : btl
Component : openib
The source file : ompi/mca/btl/openib/btl_openib.c
The function name : mca_btl_openib_ft_event

* Following message is printed in mca_btl_openib_ft_event.
  a.out: ../../../../../ompi/mca/btl/openib/btl_openib.c:1603: mca_btl_openib_ft_event: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
((opal_object_t *) (&mca_btl_openib_component.ib_procs))->obj_magic_id' failed.

* Hardware/System requirement.
  There are two active openib ports.

  Here's the output of ifconfig.
      ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
      ib2 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00

  Here's the output of ibv_devinfo.
      hca_id: mlx4_0
                port: 1
                        state: PORT_ACTIVE (4)
                port: 2
                        state: PORT_DOWN (1)
      hca_id: mlx4_1
                port: 1
                        state: PORT_ACTIVE (4)
                port: 2
                        state: PORT_DOWN (1)

* Debugging output.
   mpiexec -n 2 -mca btl self,openib -am ft-enable-cr ...

   DEBUG: mca_bml_r2_ft_event 0 num_btl_modules=33
   DEBUG: r2 call btl ft 2aaaade70213 0 self
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib
   DEBUG: r2 call btl ft 2aaaad9fa2ac 0 openib

   Number of processes is 2.
   Specified btl is self,openib.
   Total btl module count is 33 and openib module count is 32.

* r2 ft_event function calls btl ft_event function in each module.
  Therefore, it calls openib's ft_event function(mca_btl_openib_ft_event) 32 times.

            /*
             * Call ft_event in:
             * - BTL modules
             * - MPool modules
             *
             * These should be cleaning out stale state, and memory references in
             * preparation for being shut down.
             */
            for(btl_idx = 0; btl_idx < mca_bml_r2.num_btl_modules; btl_idx++) {

* mca_btl_openib_ft_event seems to release all openib environments at a time.

        for (i = 0; i < mca_btl_openib_component.ib_num_btls; ++i ) {
            mca_btl_openib_finalize_resources( &(mca_btl_openib_component.openib_btls[i])->super);
        }
           /* closing all openib modules at a time. */

        mca_btl_openib_component.devices_count = 0;
        mca_btl_openib_component.ib_num_btls = 0;
        OBJ_DESTRUCT(&mca_btl_openib_component.ib_procs);
           /* When mca_btl_openib_ft_event is called for the second time,
              an error occurs at this point. */

        ompi_btl_openib_connect_base_finalize();

* case using tcpip instead of openib.(for reference)
   mpiexec -n 2 -mca btl self,tcp -am ft-enable-cr ...

    DEBUG: mca_bml_r2_ft_event 0 num_btl_modules=4
    DEBUG: r2 call btl ft 2aaaad89d213 0 self
    DEBUG: r2 call btl ft 2aaaadaad590 0 tcp
    DEBUG: r2 call btl ft 2aaaadaad590 0 tcp
    DEBUG: r2 call btl ft 2aaaadaad590 0 tcp

   tcpip module count is 3.
   r2 ft_event function calls tcp's ft_event function(mca_btl_tcp_ft_event) 3 times.
   But there is no action in mca_btl_tcp_ft_event.
   (It means NOP operation 3 times.)

* Should r2 ft_event call btl ft_event function only once on each btl component?