TERMINAL 1: [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am ft-enable-cr-recovery --hostfile ../hostfile --display-map --display-allocation --bynode ./whoami 10 10 [clus9:25998] mca: base: components_open: Looking for errmgr components [clus9:25998] mca: base: components_open: including only errmgr components that are checkpoint enabled [clus9:25998] mca: base: components_open: (errmgr) Component app is Checkpointable [clus9:25998] mca: base: components_open: (errmgr) Component hnp is Checkpointable [clus9:25998] mca: base: components_open: (errmgr) Component orted is Checkpointable [clus9:25998] mca: base: components_open: opening errmgr components [clus9:25998] mca: base: components_open: found loaded component app [clus9:25998] mca: base: components_open: component app has no register function [clus9:25998] mca: base: components_open: component app open function successful [clus9:25998] mca: base: components_open: found loaded component hnp [clus9:25998] mca: base: components_open: component hnp has no register function [clus9:25998] errmgr:hnp: open() [clus9:25998] errmgr:hnp: open: priority = 50 [clus9:25998] errmgr:hnp: open: verbosity = 0 [clus9:25998] errmgr:hnp: open: --- CR Migration Options --- [clus9:25998] errmgr:hnp: open: Process Migration = Enabled [clus9:25998] errmgr:hnp: open: timing = Disabled [clus9:25998] errmgr:hnp: open: --- Auto. Recovery Options --- [clus9:25998] errmgr:hnp: open: Auto. Recover = Enabled [clus9:25998] errmgr:hnp: open: timing = Disabled [clus9:25998] errmgr:hnp: open: recover_delay = 1 [clus9:25998] mca: base: components_open: component hnp open function successful [clus9:25998] mca: base: components_open: found loaded component orted [clus9:25998] mca: base: components_open: component orted has no register function [clus9:25998] mca: base: components_open: component orted open function successful [clus9:25998] mca:base:select: Auto-selecting errmgr components [clus9:25998] mca:base:select:(errmgr) Querying component [app] [clus9:25998] mca:base:select:(errmgr) Skipping component [app]. Query failed to return a module [clus9:25998] mca:base:select:(errmgr) Querying component [hnp] [clus9:25998] errmgr:hnp:component_query() [clus9:25998] mca:base:select:(errmgr) Query of component [hnp] set priority to 50 [clus9:25998] mca:base:select:(errmgr) Querying component [orted] [clus9:25998] mca:base:select:(errmgr) Skipping component [orted]. Query failed to return a module [clus9:25998] mca:base:select:(errmgr) Selected component [hnp] [clus9:25998] mca: base: close: component app closed [clus9:25998] mca: base: close: unloading component app [clus9:25998] mca: base: close: component orted closed [clus9:25998] mca: base: close: unloading component orted [clus9:25998] errmgr:hnp(crmig): init() [clus9:25998] errmgr:base:tool: Startup Command Line Channel [clus9:25998] errmgr:hnp(autor):init() [clus9:25998] snapc:full: open() [clus9:25998] snapc:full: open: priority = 20 [clus9:25998] snapc:full: open: verbosity = 20 [clus9:25998] snapc:full: open: max_wait_time = 20 [clus9:25998] snapc:full: open: progress_meter = 0 [clus9:25998] snapc:full: component_query() [clus9:25998] snapc:full: module_init(1, 1) [clus9:25998] snapc:full: module_init: Global Snapshot Coordinator [clus9:25998] [[54840,0],0] hostfile: checking hostfile ../hostfile for nodes ====================== ALLOCATED NODES ====================== Data for node: clus9 Num slots: 8 Max slots: 8 Data for node: node1 Num slots: 8 Max slots: 8 ================================================================= [clus9:25998] [[54840,0],0] hostfile: filtering nodes through hostfile ../hostfile ======================== JOB MAP ======================== Data for node: clus9 Num procs: 1 Process OMPI jobid: [54840,1] Process rank: 0 Data for node: node1 Num procs: 1 Process OMPI jobid: [54840,1] Process rank: 1 ============================================================= [clus9:25998] Global) Setup job [54840,1] as the Global Coordinator [clus9:25998] Global) [0] Found Daemon [[54840,0],0] with 1 procs [clus9:25998] Global) [0] Found Process [[54840,1],0] on Daemon [[54840,0],0] [clus9:25998] Global) [1] Found Daemon [[54840,0],1] with 1 procs [clus9:25998] Global) [0] Found Process [[54840,1],1] on Daemon [[54840,0],1] [clus9:25998] Global) Startup Coordinator Channel [clus9:25998] Global) Startup Command Line Channel [clus9:25998] Global) Finished setup of job [54840,1] [clus9:25998] progressed_wait: ../../../../../orte/mca/plm/rsh/plm_rsh_module.c 1378 hmeyer@node1's password: Daemon was launched on clus1 - beginning to initialize [clus1:11536] mca: base: components_open: Looking for errmgr components [clus1:11536] mca: base: components_open: including only errmgr components that are checkpoint enabled [clus1:11536] mca: base: components_open: (errmgr) Component app is Checkpointable [clus1:11536] mca: base: components_open: (errmgr) Component hnp is Checkpointable [clus1:11536] mca: base: components_open: (errmgr) Component orted is Checkpointable [clus1:11536] mca: base: components_open: opening errmgr components [clus1:11536] mca: base: components_open: found loaded component app [clus1:11536] mca: base: components_open: component app has no register function [clus1:11536] mca: base: components_open: component app open function successful [clus1:11536] mca: base: components_open: found loaded component hnp [clus1:11536] mca: base: components_open: component hnp has no register function [clus1:11536] errmgr:hnp: open() [clus1:11536] errmgr:hnp: open: priority = 50 [clus1:11536] errmgr:hnp: open: verbosity = 0 [clus1:11536] errmgr:hnp: open: --- CR Migration Options --- [clus1:11536] errmgr:hnp: open: Process Migration = Enabled [clus1:11536] errmgr:hnp: open: timing = Disabled [clus1:11536] errmgr:hnp: open: --- Auto. Recovery Options --- [clus1:11536] errmgr:hnp: open: Auto. Recover = Enabled [clus1:11536] errmgr:hnp: open: timing = Disabled [clus1:11536] errmgr:hnp: open: recover_delay = 1 [clus1:11536] mca: base: components_open: component hnp open function successful [clus1:11536] mca: base: components_open: found loaded component orted [clus1:11536] mca: base: components_open: component orted has no register function [clus1:11536] mca: base: components_open: component orted open function successful [clus1:11536] mca:base:select: Auto-selecting errmgr components [clus1:11536] mca:base:select:(errmgr) Querying component [app] [clus1:11536] mca:base:select:(errmgr) Skipping component [app]. Query failed to return a module [clus1:11536] mca:base:select:(errmgr) Querying component [hnp] [clus1:11536] errmgr:hnp:component_query() [clus1:11536] mca:base:select:(errmgr) Skipping component [hnp]. Query failed to return a module [clus1:11536] mca:base:select:(errmgr) Querying component [orted] [clus1:11536] mca:base:select:(errmgr) Query of component [orted] set priority to 10 [clus1:11536] mca:base:select:(errmgr) Selected component [orted] [clus1:11536] mca: base: close: component app closed [clus1:11536] mca: base: close: unloading component app [clus9:25998] defining message event: ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 164 [clus9:25998] progressed_wait: ../../../../orte/mca/plm/base/plm_base_launch_support.c 357 [clus9:25998] [[54840,0],0] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus9:25998] [[54840,0],0] orte:daemon:send_relay [clus9:25998] [[54840,0],0] orte:daemon:send_relay sending relay msg to 1 [clus9:25998] [[54840,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS [clus9:25998] [[54840,0],0] orted_cmd: received add_local_procs [clus9:25998] Global) Setup job [54840,1] as the Local Coordinator [clus9:25998] Local) Setting up jobid [54840,1] [clus9:25998] Local) Startup Application State Channel [clus9:25998] Local) Finished setup of job [54840,1] [clus1:11536] errmgr:hnp: close() [clus1:11536] mca: base: close: component hnp closed [clus1:11536] mca: base: close: unloading component hnp [clus1:11536] snapc:full: open() [clus1:11536] snapc:full: open: priority = 20 [clus1:11536] snapc:full: open: verbosity = 20 [clus1:11536] snapc:full: open: max_wait_time = 20 [clus1:11536] snapc:full: open: progress_meter = 0 [clus1:11536] snapc:full: component_query() [clus1:11536] snapc:full: module_init(0, 0) [clus1:11536] snapc:full: module_init: Local Snapshot Coordinator Daemon [[54840,0],1] checking in as pid 11536 on host clus1 [clus1:11536] [[54840,0],1] orted: up and running - waiting for commands! [clus1:11536] [[54840,0],1] orted_recv_cmd: received message from [[54840,0],0] [clus1:11536] defining message event: ../../orte/orted/orted_comm.c 162 [clus1:11536] [[54840,0],1] orted_recv_cmd: reissued recv [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus9:25998] errmgr:hnp:update_state() [[54840,0],0]) ------- App. Process state updated for process [[54840,1],0] [clus9:25998] [[54840,0],0] errmgr:hnp(crmig): job [54840,1] reported state LAUNCHED for proc [[54840,1],0] state LAUNCHED exit_code 0 [clus9:25998] [[54840,0],0] errmgr:hnp(autor): job [54840,1] reported state LAUNCHED for proc [[54840,1],0] state LAUNCHED exit_code 0 [clus9:25998] [[54840,0],0] errmgr:hnp: job [54840,1] reported state LAUNCHED for proc [[54840,1],0] state LAUNCHED pid 26001 exit_code 0 [clus9:25998] errmgr:hnp:update_state() [[54840,0],0]) ------- App. Process state updated for process NULL [clus9:25998] [[54840,0],0] errmgr:hnp(crmig): job [54840,1] reported state RUNNING for proc NULL state UNDEFINED exit_code 1 [clus9:25998] [[54840,0],0] errmgr:hnp(autor): job [54840,1] reported state RUNNING for proc NULL state UNDEFINED exit_code 1 [clus9:25998] [[54840,0],0] errmgr:hnp: job [54840,1] reported state RUNNING for proc NULL state UNDEFINED pid 0 exit_code 1 [clus9:25998] [[54840,0],0] errmgr:hnp: job [54840,1] reported state RUNNING [clus9:25998] errmgr:hnp:update_state() [[54840,0],0]) ------- App. Process state updated for process [[54840,1],1] [clus9:25998] [[54840,0],0] errmgr:hnp(crmig): job [54840,1] reported state UNDEFINED for proc [[54840,1],1] state RUNNING exit_code 0 [clus9:25998] [[54840,0],0] errmgr:hnp(autor): job [54840,1] reported state UNDEFINED for proc [[54840,1],1] state RUNNING exit_code 0 [clus9:25998] [[54840,0],0] errmgr:hnp: job [54840,1] reported state UNDEFINED for proc [[54840,1],1] state RUNNING pid 11537 exit_code 0 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 Antes de MPI_Init Antes de MPI_Init [clus1:11536] [[54840,0],1] node[0].name clus9 daemon 0 [clus1:11536] [[54840,0],1] node[1].name node1 daemon 1 [clus1:11536] [[54840,0],1] orte:daemon:send_relay [clus1:11536] [[54840,0],1] orte:daemon:send_relay - recipient list is empty! [clus1:11536] [[54840,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS [clus1:11536] [[54840,0],1] orted_cmd: received add_local_procs [clus1:11536] Local) Setting up jobid [54840,1] [clus1:11536] Local) Startup Coordinator Channel [clus1:11536] Local) Startup Application State Channel [clus1:11536] Local) Finished setup of job [54840,1] [clus1:11536] [[54840,0],1] errmgr:orted got state LAUNCHED for proc [[54840,1],1] pid 11537 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] mca: base: components_open: Looking for errmgr components [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] mca: base: components_open: including only errmgr components that are checkpoint enabled [clus1:11537] mca: base: components_open: (errmgr) Component app is Checkpointable [clus1:11537] mca: base: components_open: (errmgr) Component hnp is Checkpointable [clus1:11537] mca: base: components_open: (errmgr) Component orted is Checkpointable [clus1:11537] mca: base: components_open: opening errmgr components [clus1:11537] mca: base: components_open: found loaded component app [clus1:11537] mca: base: components_open: component app has no register function [clus1:11537] mca: base: components_open: component app open function successful [clus1:11537] mca: base: components_open: found loaded component hnp [clus1:11537] mca: base: components_open: component hnp has no register function [clus1:11537] errmgr:hnp: open() [clus1:11537] errmgr:hnp: open: priority = 50 [clus1:11537] errmgr:hnp: open: verbosity = 0 [clus1:11537] errmgr:hnp: open: --- CR Migration Options --- [clus1:11537] errmgr:hnp: open: Process Migration = Enabled [clus1:11537] errmgr:hnp: open: timing = Disabled [clus1:11537] errmgr:hnp: open: --- Auto. Recovery Options --- [clus1:11537] errmgr:hnp: open: Auto. Recover = Enabled [clus1:11537] errmgr:hnp: open: timing = Disabled [clus1:11537] errmgr:hnp: open: recover_delay = 1 [clus1:11537] mca: base: components_open: component hnp open function successful [clus1:11537] mca: base: components_open: found loaded component orted [clus1:11537] mca: base: components_open: component orted has no register function [clus1:11537] mca: base: components_open: component orted open function successful [clus1:11537] mca:base:select: Auto-selecting errmgr components [clus1:11537] mca:base:select:(errmgr) Querying component [app] [clus1:11537] mca:base:select:(errmgr) Query of component [app] set priority to 10 [clus1:11537] mca:base:select:(errmgr) Querying component [hnp] [clus1:11537] errmgr:hnp:component_query() [clus1:11537] mca:base:select:(errmgr) Skipping component [hnp]. Query failed to return a module [clus1:11537] mca:base:select:(errmgr) Querying component [orted] [clus1:11537] mca:base:select:(errmgr) Skipping component [orted]. Query failed to return a module [clus1:11537] mca:base:select:(errmgr) Selected component [app] [clus1:11537] errmgr:hnp: close() [clus1:11537] mca: base: close: component hnp closed [clus1:11537] mca: base: close: unloading component hnp [clus1:11537] mca: base: close: component orted closed [clus1:11537] mca: base: close: unloading component orted [clus1:11536] [[54840,0],1] orted_recv_cmd: received message from [[54840,1],1] [clus1:11536] defining message event: ../../orte/orted/orted_comm.c 162 [clus1:11536] [[54840,0],1] orted_recv_cmd: reissued recv [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor called by [[54840,1],1] for tag 1 [clus1:11536] [[54840,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP [clus1:11536] [[54840,0],1] orted_recv: received sync+nidmap from local proc [[54840,1],1] [clus9:25998] errmgr:hnp:update_state() [[54840,0],0]) ------- App. Process state updated for process [[54840,1],1] [clus9:25998] [[54840,0],0] errmgr:hnp(crmig): job [54840,1] reported state SYNC REGISTERED for proc [[54840,1],1] state SYNC REGISTERED exit_code 1 [clus9:25998] [[54840,0],0] errmgr:hnp(autor): job [54840,1] reported state SYNC REGISTERED for proc [[54840,1],1] state SYNC REGISTERED exit_code 1 [clus9:25998] [[54840,0],0] errmgr:hnp: job [54840,1] reported state SYNC REGISTERED for proc [[54840,1],1] state SYNC REGISTERED pid 0 exit_code 1 [clus1:11536] [[54840,0],1] errmgr:orted got state SYNC REGISTERED for proc [[54840,1],1] pid 0 [clus1:11536] [[54840,0],1] errmgr:orted: sending contact info to HNP [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor: processing commands completed [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] snapc:full: open() [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] snapc:full: open: priority = 20 [clus1:11537] snapc:full: open: verbosity = 20 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] snapc:full: open: max_wait_time = 20 [clus1:11537] snapc:full: open: progress_meter = 0 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] snapc:full: component_query() [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] snapc:full: module_init(0, 1) [clus9:25998] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus1:11537] snapc:full: module_init: Application Snapshot Coordinator [clus1:11537] App) Initalized for Application [[54840,1],1] [clus1:11537] app) Named Pipes (/tmp/opal_cr_prog_read.11537_0) (/tmp/opal_cr_prog_write.11537_0), Signal (10) [clus1:11536] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:25998] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:26001] mca: base: components_open: Looking for errmgr components [clus9:26001] mca: base: components_open: including only errmgr components that are checkpoint enabled [clus9:26001] mca: base: components_open: (errmgr) Component app is Checkpointable [clus9:26001] mca: base: components_open: (errmgr) Component hnp is Checkpointable [clus9:26001] mca: base: components_open: (errmgr) Component orted is Checkpointable [clus9:26001] mca: base: components_open: opening errmgr components [clus9:26001] mca: base: components_open: found loaded component app [clus9:26001] mca: base: components_open: component app has no register function [clus9:26001] mca: base: components_open: component app open function successful [clus9:26001] mca: base: components_open: found loaded component hnp [clus9:26001] mca: base: components_open: component hnp has no register function [clus9:26001] errmgr:hnp: open() [clus9:26001] errmgr:hnp: open: priority = 50 [clus9:26001] errmgr:hnp: open: verbosity = 0 [clus9:26001] errmgr:hnp: open: --- CR Migration Options --- [clus9:26001] errmgr:hnp: open: Process Migration = Enabled [clus9:26001] errmgr:hnp: open: timing = Disabled [clus9:26001] errmgr:hnp: open: --- Auto. Recovery Options --- [clus9:26001] errmgr:hnp: open: Auto. Recover = Enabled [clus9:26001] errmgr:hnp: open: timing = Disabled [clus9:26001] errmgr:hnp: open: recover_delay = 1 [clus9:26001] mca: base: components_open: component hnp open function successful [clus9:26001] mca: base: components_open: found loaded component orted [clus9:26001] mca: base: components_open: component orted has no register function [clus9:26001] mca: base: components_open: component orted open function successful [clus9:26001] mca:base:select: Auto-selecting errmgr components [clus9:26001] mca:base:select:(errmgr) Querying component [app] [clus9:26001] mca:base:select:(errmgr) Query of component [app] set priority to 10 [clus9:26001] mca:base:select:(errmgr) Querying component [hnp] [clus9:26001] errmgr:hnp:component_query() [clus9:26001] mca:base:select:(errmgr) Skipping component [hnp]. Query failed to return a module [clus9:26001] mca:base:select:(errmgr) Querying component [orted] [clus9:26001] mca:base:select:(errmgr) Skipping component [orted]. Query failed to return a module [clus9:26001] mca:base:select:(errmgr) Selected component [app] [clus9:26001] errmgr:hnp: close() [clus9:26001] mca: base: close: component hnp closed [clus9:26001] mca: base: close: unloading component hnp [clus9:26001] mca: base: close: component orted closed [clus9:26001] mca: base: close: unloading component orted [clus9:25998] [[54840,0],0] orted_recv_cmd: received message from [[54840,1],0] [clus9:25998] defining message event: ../../orte/orted/orted_comm.c 162 [clus9:25998] [[54840,0],0] orted_recv_cmd: reissued recv [clus9:25998] [[54840,0],0] orte:daemon:cmd:processor called by [[54840,1],0] for tag 1 [clus9:25998] [[54840,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP [clus9:25998] [[54840,0],0] orted_recv: received sync+nidmap from local proc [[54840,1],0] [clus9:25998] errmgr:hnp:update_state() [[54840,0],0]) ------- App. Process state updated for process [[54840,1],0] [clus9:25998] [[54840,0],0] errmgr:hnp(crmig): job [INVALID] reported state UNDEFINED for proc [[54840,1],0] state SYNC REGISTERED exit_code 0 [clus9:25998] [[54840,0],0] errmgr:hnp(autor): job [INVALID] reported state UNDEFINED for proc [[54840,1],0] state SYNC REGISTERED exit_code 0 [clus9:25998] [[54840,0],0] errmgr:hnp: job [INVALID] reported state UNDEFINED for proc [[54840,1],0] state SYNC REGISTERED pid 0 exit_code 0 [clus9:25998] [[54840,0],0] orte:daemon:cmd:processor: processing commands completed [clus9:26001] snapc:full: open() [clus9:26001] snapc:full: open: priority = 20 [clus9:26001] snapc:full: open: verbosity = 20 [clus9:26001] snapc:full: open: max_wait_time = 20 [clus9:26001] snapc:full: open: progress_meter = 0 [clus9:26001] snapc:full: component_query() [clus9:25998] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:26001] snapc:full: module_init(0, 1) [clus9:26001] snapc:full: module_init: Application Snapshot Coordinator [clus9:26001] App) Initalized for Application [[54840,1],0] [clus9:26001] app) Named Pipes (/tmp/opal_cr_prog_read.26001_0) (/tmp/opal_cr_prog_write.26001_0), Signal (10) [clus9:26001] app) Startup Barrier... [clus9:25998] defining message event: ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 164 [clus9:25998] [[54840,0],0] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus9:25998] [[54840,0],0] orte:daemon:send_relay [clus9:25998] [[54840,0],0] orte:daemon:send_relay sending relay msg to 1 [clus9:25998] [[54840,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [clus9:25998] [[54840,0],0] orted_cmd: received message_local_procs [clus9:25998] [[54840,0],0] orted:comm:message_local_procs delivering message to job [54840,1] tag 17 [clus9:26001] app) Startup Barrier: Send INIT to HNP...! [clus1:11536] [[54840,0],1] orted_recv_cmd: received message from [[54840,0],0] [clus1:11536] defining message event: ../../orte/orted/orted_comm.c 162 [clus1:11536] [[54840,0],1] orted_recv_cmd: reissued recv [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus9:25998] Global) Receive a command message from [[54840,1],0]. [clus9:25998] defining message event: ../../../../../orte/mca/snapc/full/snapc_full_global.c 1021 [clus9:26001] app) Startup Barrier: Done! [clus9:25998] Global) Command: Request Op [clus9:25998] Global) process_request_op(): Op Code 1 [clus9:25998] Global) process_request_op(): Checkpointing Enabled ( 1) [clus1:11536] [[54840,0],1] orte:daemon:send_relay [clus1:11536] [[54840,0],1] orte:daemon:send_relay - recipient list is empty! [clus1:11536] [[54840,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [clus1:11536] [[54840,0],1] orted_cmd: received message_local_procs [clus1:11536] [[54840,0],1] orted:comm:message_local_procs delivering message to job [54840,1] tag 17 [clus1:11536] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:25998] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:25998] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:25998] defining message event: ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 164 [clus9:25998] [[54840,0],0] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus9:25998] [[54840,0],0] orte:daemon:send_relay [clus9:25998] [[54840,0],0] orte:daemon:send_relay sending relay msg to 1 [clus9:25998] [[54840,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [clus9:25998] [[54840,0],0] orted_cmd: received message_local_procs [clus9:25998] [[54840,0],0] orted:comm:message_local_procs delivering message to job [54840,1] tag 15 [clus1:11536] [[54840,0],1] orted_recv_cmd: received message from [[54840,0],0] [clus1:11536] defining message event: ../../orte/orted/orted_comm.c 162 [clus1:11536] [[54840,0],1] orted_recv_cmd: reissued recv [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus1:11536] [[54840,0],1] orte:daemon:send_relay [clus1:11536] [[54840,0],1] orte:daemon:send_relay - recipient list is empty! [clus1:11536] [[54840,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [clus1:11536] [[54840,0],1] orted_cmd: received message_local_procs [clus1:11536] [[54840,0],1] orted:comm:message_local_procs delivering message to job [54840,1] tag 15 [clus9:25998] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus1:11536] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:25998] defining message event: ../../../../orte/mca/grpcomm/base/grpcomm_base_coll.c 899 [clus9:25998] defining message event: ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 164 [clus9:25998] [[54840,0],0] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus9:25998] [[54840,0],0] orte:daemon:send_relay [clus9:25998] [[54840,0],0] orte:daemon:send_relay sending relay msg to 1 [clus9:25998] [[54840,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [clus9:25998] [[54840,0],0] orted_cmd: received message_local_procs [clus9:25998] [[54840,0],0] orted:comm:message_local_procs delivering message to job [54840,1] tag 17 [clus1:11536] [[54840,0],1] orted_recv_cmd: received message from [[54840,0],0] [clus1:11536] defining message event: ../../orte/orted/orted_comm.c 162 [clus1:11536] [[54840,0],1] orted_recv_cmd: reissued recv [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus1:11536] [[54840,0],1] orte:daemon:send_relay [clus1:11536] [[54840,0],1] orte:daemon:send_relay - recipient list is empty! [clus1:11536] [[54840,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS [clus1:11536] [[54840,0],1] orted_cmd: received message_local_procs [clus1:11536] [[54840,0],1] orted:comm:message_local_procs delivering message to job [54840,1] tag 17 [clus9:25998] errmgr:base:tool:recv() Command Line: Start a migration operation [Sender = [[54819,0],0]] [clus9:25998] defining message event: ../../../../orte/mca/errmgr/base/errmgr_base_tool.c 342 [clus9:25998] errmgr:base:tool:recv() Command line requested process migration [command 1] [clus9:25998] errmgr:base:tool:update() Sending update command [clus9:25998] errmgr:hnp(crmig):migrate() ------- Migrating ( 0, 1, 1) ------- [clus9:25998] errmgr:base:tool:update() Sending update command [clus9:25998] errmgr:hnp(crmig):migrate() Requested Processes to migrate: (0 procs) [clus9:25998] errmgr:hnp(crmig):migrate() Requested Nodes to migration: (1 nodes) [clus9:25998] "node1" 1 [clus9:25998] "[[54840,1],1]" [0x20000] [clus9:25998] errmgr:hnp(crmig):migrate() Suggested nodes to migration onto: (1 nodes) [clus9:25998] "clus9" [clus9:25998] errmgr:hnp(crmig):migrate() Suggested nodes to migration onto (exclusive): (0 nodes) [clus9:25998] errmgr:hnp(crmig):migrate() All Migrating Processes: (1 procs) [clus9:25998] "[[54840,1],1]" [0x20000] [node1] -------------------------------------------------------------------------- Notice: A migration of this job has been requested. The processes below will be migrated. Please standby. [[54840,1],1] Rank 1 on Node node1 -------------------------------------------------------------------------- [clus9:25998] errmgr:base:tool:update() Sending update command [clus9:25998] errmgr:hnp(crmig):migrate() ------- Starting the checkpoint of job [54840,1] ------- [clus9:25998] errmgr:hnp(crmig):migrate() ------- Terminate old processes in job [54840,1] ------- [clus9:25998] defining message event: ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 164 [clus9:25998] errmgr:hnp(crmig):migrate() ------- Waiting for termination ------- [clus9:25998] [[54840,0],0] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus9:25998] [[54840,0],0] orte:daemon:send_relay [clus9:25998] [[54840,0],0] orte:daemon:send_relay sending relay msg to 1 [clus9:25998] [[54840,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_KILL_LOCAL_PROCS [clus9:25998] Still waiting for termination: "[[54840,1],1]" [0x20000] != [0x100] [clus1:11536] [[54840,0],1] orted_recv_cmd: received message from [[54840,0],0] [clus1:11536] defining message event: ../../orte/orted/orted_comm.c 162 [clus1:11536] [[54840,0],1] orted_recv_cmd: reissued recv [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor called by [[54840,0],0] for tag 1 [clus1:11536] [[54840,0],1] orte:daemon:send_relay [clus1:11536] [[54840,0],1] orte:daemon:send_relay - recipient list is empty! [clus1:11536] [[54840,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_KILL_LOCAL_PROCS [clus1:11536] defining message event: ../../../../../orte/mca/iof/orted/iof_orted_read.c 218 [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor called by [[54840,0],1] for tag 1 [clus1:11536] [[54840,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_IOF_COMPLETE [clus1:11536] [[54840,0],1] orted_cmd: received iof_complete cmd [clus1:11536] [[54840,0],1] errmgr:orted got state KILLED BY INTERNAL COMMAND for proc [[54840,1],1] pid 11537 [clus9:25998] Still waiting for termination: "[[54840,1],1]" [0x20000] != [0x100] [clus9:25998] errmgr:hnp:update_state() [[54840,0],0]) ------- App. Process state updated for process [[54840,1],1] [clus9:25998] [[54840,0],0] errmgr:hnp(crmig): job [54840,1] reported state UNDEFINED for proc [[54840,1],1] state KILLED BY INTERNAL COMMAND exit_code 0 [clus9:25998] [[54840,0],0] errmgr:hnp(autor): job [54840,1] reported state UNDEFINED for proc [[54840,1],1] state KILLED BY INTERNAL COMMAND exit_code 0 [clus9:25998] errmgr:hnp(crmig):migrate() ------- Checkpoint finished, setting up job [54840,1] ------- [clus9:25998] errmgr:base:tool:update() Sending update command [clus9:25998] [[54840,0],0]:../../../../orte/mca/plm/base/plm_base_rsh_support.c(1438) reseting exit status [clus9:25998] *** Process received signal *** [clus9:25998] Signal: Segmentation fault (11) [clus9:25998] Signal code: Address not mapped (1) [clus9:25998] Failing at address: 0x98 [clus1:11536] [[54840,0],1] errmgr:orted reporting proc [[54840,1],1] aborted to HNP (local procs = 0) [clus1:11536] [[54840,0],1] orte:daemon:cmd:processor: processing commands completed [clus9:25998] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b9d40] [clus9:25998] [ 1] /home/hmeyer/desarrollo/ompi-code/binarios/lib/libopen-rte.so.0 [0x2aaaaad2780b] [clus9:25998] [ 2] /home/hmeyer/desarrollo/ompi-code/binarios/lib/libopen-rte.so.0(orte_errmgr_base_update_app_context_for_cr_recovery+0x90) [0x2aaaaad270a5] [clus9:25998] [ 3] /home/hmeyer/desarrollo/ompi-code/binarios/lib/openmpi/mca_errmgr_hnp.so [0x2aaaaf18e0ab] [clus9:25998] [ 4] /home/hmeyer/desarrollo/ompi-code/binarios/lib/openmpi/mca_errmgr_hnp.so [0x2aaaaf18b586] [clus9:25998] [ 5] /home/hmeyer/desarrollo/ompi-code/binarios/lib/openmpi/mca_errmgr_hnp.so [0x2aaaaf183b98] [clus9:25998] [ 6] /home/hmeyer/desarrollo/ompi-code/binarios/lib/libopen-rte.so.0 [0x2aaaaad28fe0] [clus9:25998] [ 7] /home/hmeyer/desarrollo/ompi-code/binarios/lib/libopen-rte.so.0 [0x2aaaaadc7263] [clus9:25998] [ 8] /home/hmeyer/desarrollo/ompi-code/binarios/lib/libopen-rte.so.0 [0x2aaaaadc76ae] [clus9:25998] [ 9] /home/hmeyer/desarrollo/ompi-code/binarios/lib/libopen-rte.so.0(event_base_loop+0x1cf) [0x2aaaaadc7a7d] [clus9:25998] [10] /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun [0x4039b5] [clus9:25998] [11] /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun [0x402c43] [clus9:25998] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaac2e48a4] [clus9:25998] [13] /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun [0x402b99] [clus9:25998] *** End of error message *** Segmentation fault [clus1:11536] [[54840,0],1] routed:cm: Connection to lifeline [[54840,0],0] lost [hmeyer@clus9 whoami]$ [clus1:11536] snapc:full: module_finalize() [clus1:11536] Local) Shutdown Application State Channel [clus1:11536] Local) Shutdown Coordinator Channel [clus1:11536] snapc:full: close() TERMINAL 2 [hmeyer@clus9 codes]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node1 -t clus9 25998 [clus9:26005] snapc:full: open() [clus9:26005] snapc:full: open: priority = 20 [clus9:26005] snapc:full: open: verbosity = 20 [clus9:26005] snapc:full: open: max_wait_time = 20 [clus9:26005] snapc:full: open: progress_meter = 0 [clus9:26005] snapc:full: close() [clus9:26005] *** Process received signal *** [clus9:26005] Signal: Segmentation fault (11) [clus9:26005] Signal code: Address not mapped (1) [clus9:26005] Failing at address: (nil) [clus9:26005] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b9d40] [clus9:26005] *** End of error message *** Segmentation fault