##################### [clus5:19424] [[65478,0],1] errmgr:orted got state ABORTED BY SIGNAL for proc [[65478,1],1] pid 19425 [clus9:19568] [[65478,0],0] orted_recv_cmd: received message from [[65478,0],1] [clus9:19568] defining message event: ../../orte/orted/orted_comm.c 174 [clus9:19568] [[65478,0],0] orted_recv_cmd: reissued recv [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,0],1] for tag 1 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: Unknown Command! [clus9:19568] [[65478,0],0] orted_recv: (RADIC)update state request from [[65478,0],1] [clus9:19568] UNPACK NAME correcto:[[65478,1],1] [clus9:19568] UNPACK STATE correcto:17 [clus9:19568] UNPACK PID correcto:19425 [clus9:19568] UNPACK EXIT CODE correcto:137 [clus5:19424] [[65478,0],1] errmgr:orted RADIC enabled, ignorando abort del proc [[65478,1],1] (OK, let's restart it) [clus5:19424] CHILD a restaurar [[65478,1],1] [clus5:19424] ENVIANDO AL ERRMGR HNP CON JOBSTATE = UNDEFINED y state del proc 17 [clus5:19424] ENVIANDO AL HNP EL PROCESO A RESTAURAR [clus5:19424] [[65478,0],1] orte:daemon:cmd:processor: processing commands completed [clus9:19568] Intentando recuperar al proc [[65478,1],1], soy el HNP [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- App. Process state updated for process [[65478,1],1] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,1] reported state UNDEFINED for proc [[65478,1],1] state UNKNOWN STATE! pid 19425 exit_code 137 [clus9:19568] errmgr:hnp: job reported estado 17 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] errmgr:hnp: ESTADO = ORTE_PROC_STATE_FAULT [clus9:19568] [[65478,0],0] RELOCANDO PROC [[65478,1],1] of node node5 and jdata jobid [65478,1] and pdata [[65478,1],1] and 0 [clus9:19568] Pasé el getitem [clus9:19568] Pasé el max-restart [clus9:19568] [[65478,0],0] LUEGO DEL APP RELOCANDO PROC [[65478,1],1] [clus9:19568] [[65478,0],0]:../../../../orte/mca/plm/base/plm_base_rsh_support.c(1438) reseting exit status [clus9:19568] [[65478,0],0] RELOCATING APP [[65478,1],1] [clus9:19568] [[65478,0],0] ANTES DEL SPAWN [clus9:19568] defining message event: ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 164 [clus9:19568] progressed_wait: ../../../../orte/mca/plm/base/plm_base_launch_support.c 357 [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus9:19568] [[65478,0],0] orte:daemon:send_relay [clus9:19568] [[65478,0],0] orte:daemon:send_relay sending relay msg to 1 [clus9:19568] [[65478,0],0] orte:daemon:send_relay sending relay msg to 2 [clus9:19568] [[65478,0],0] orte:daemon:send_relay sending relay msg to 3 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS [clus9:19568] [[65478,0],0] orted_cmd: received add_local_procs [clus5:19424] [[65478,0],1] orted_recv_cmd: received message from [[65478,0],0] [clus3:21615] [[65478,0],2] orted_recv_cmd: received message from [[65478,0],0] [clus3:21615] defining message event: ../../orte/orted/orted_comm.c 174 [clus5:19424] [[65478,0],1] orte:daemon:send_relay [clus5:19424] [[65478,0],1] orte:daemon:send_relay - recipient list is empty! [clus5:19424] [[65478,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS [clus3:21615] [[65478,0],2] orted_recv_cmd: reissued recv [clus3:21615] [[65478,0],2] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus3:21615] [[65478,0],2] node[0].name clus9 daemon 0 [clus3:21615] [[65478,0],2] node[1].name node5 daemon 1 [clus3:21615] [[65478,0],2] node[2].name node3 daemon 2 [clus3:21615] [[65478,0],2] node[3].name node4 daemon 3 [clus3:21615] [[65478,0],2] orte:daemon:send_relay [clus3:21615] [[65478,0],2] orte:daemon:send_relay - recipient list is empty! [clus3:21615] [[65478,0],2] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS [clus3:21615] [[65478,0],2] orted_cmd: received add_local_procs [clus5:19424] [[65478,0],1] orted_cmd: received add_local_procs [clus4:18112] [[65478,0],3] orted_recv_cmd: received message from [[65478,0],0] [clus4:18112] defining message event: ../../orte/orted/orted_comm.c 174 [clus4:18112] [[65478,0],3] orted_recv_cmd: reissued recv [clus4:18112] [[65478,0],3] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus4:18112] [[65478,0],3] node[0].name clus9 daemon 0 [clus4:18112] [[65478,0],3] node[1].name node5 daemon 1 [clus4:18112] [[65478,0],3] node[2].name node3 daemon 2 [clus4:18112] [[65478,0],3] node[3].name node4 daemon 3 [clus4:18112] [[65478,0],3] orte:daemon:send_relay [clus4:18112] [[65478,0],3] orte:daemon:send_relay - recipient list is empty! [clus4:18112] [[65478,0],3] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS [clus4:18112] [[65478,0],3] orted_cmd: received add_local_procs [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- App. Process state updated for process [[65478,1],3] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,1] reported state UNDEFINED for proc [[65478,1],3] state RUNNING pid 18113 exit_code 0 [clus9:19568] errmgr:hnp: job reported estado 16 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- App. Process state updated for process [[65478,1],1] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,1] reported state UNDEFINED for proc [[65478,1],1] state RUNNING pid 19425 exit_code 137 [clus9:19568] errmgr:hnp: job reported estado 16 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] [[65478,0],0] orted_recv_cmd: received message from [[65478,1],1] [clus9:19568] defining message event: ../../orte/orted/orted_comm.c 174 [clus9:19568] [[65478,0],0] orted_recv_cmd: reissued recv [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,1],1] for tag 1 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: Unknown Command! [clus9:19568] [[65478,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../orte/orted/orted_comm.c at line 1896 [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor failed on error Bad parameter [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor: processing commands completed [clus3:21615] [[65478,0],2] orted_recv_cmd: received message from [[65478,1],1] [clus3:21615] defining message event: ../../orte/orted/orted_comm.c 174 [clus3:21615] [[65478,0],2] orted_recv_cmd: reissued recv [clus3:21615] [[65478,0],2] orte:daemon:cmd:processor called by [[65478,1],1] for tag 1 [clus3:21615] [[65478,0],2] orted:comm:process_commands() Processing Command: Unknown Command! [clus3:21615] [[65478,0],2] orted_recv: transfer dir request from [[65478,1],1] [clus3:21615] orted_transfer_dir: getting remote directory: [clus3:21615] orted_transfer_dir: Remote location: (1:/tmp/radic/1) [clus3:21615] orted_transfer_dir: Local location: (2:/tmp/radic/) -------------------------------------------------------------------------- WARNING: Could not preload specified file: File already exists. Fileset: /tmp/radic/ Host: clus3 Will continue attempting to launch the process. -------------------------------------------------------------------------- [clus3:21615] filem:rsh: get(): Failed to wait on the request (-1) [clus3:21615] orted_transfer_dir: removing logfile (/tmp/radic/1/radic.log) [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start SEND to=1 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=18 [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event log [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start SEND to=2 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=19 [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event log [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start SEND to=3 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=20 [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event log [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start PRE [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start POS [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start SEND to=1 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=21 [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event log [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start SEND to=2 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=22 [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event log [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start SEND to=3 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=23 [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event log [clus9:19568] defining message event: ../../../../orte/mca/odls/base/odls_base_default_fns.c 2770 [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- App. Process state updated for process [[65478,1],0] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,1] reported state COMMUNICATION FAILURE for proc [[65478,1],0] state COMMUNICATION FAILURE pid 0 exit_code 1 [clus9:19568] errmgr:hnp: job reported estado 8192 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [1,0]:[clus9:19572] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [1,0]:[clus9:19572] pml_v: vprotocol:receiver:start PRE [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_read.c 292 [1,3]:[clus4:18113] pml_v: vprotocol:receiver:recv from=0 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=6 (PRE) [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_WAITPID_FIRED [clus9:19568] [[65478,0],0] orted_cmd: received waitpid_fired cmd [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor: processing commands completed [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_IOF_COMPLETE [clus9:19568] [[65478,0],0] orted_cmd: received iof_complete cmd [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- App. Process state updated for process [[65478,1],0] [clus9:19568] [[65478,0],0] errmgr:hnp: job [INVALID] reported state UNDEFINED for proc [[65478,1],0] state ABORTED BY SIGNAL pid 19572 exit_code 141 [clus9:19568] errmgr:hnp: job reported estado 2048 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] errmgr:hnp:check_job_completed proc [[65478,1],0] aborted by signal [clus9:19568] [[65478,0],0]:../../../../../orte/mca/errmgr/hnp/errmgr_hnp.c(1136) updating exit status to 141 [clus9:19568] [[65478,0],0] errmgr:hnp:check_job_completed job [65478,1] is not terminated (1:4) [clus9:19568] [[65478,0],0] errmgr:hnp:check_job_completed at least one job is not terminated [clus9:19568] [[65478,0],0] errmgr:hnp: abort called on job [65478,1] with status 141 [clus9:19568] defining timeout: 0 sec 3000 usec at ../../../../orte/mca/plm/base/plm_base_orted_cmds.c:186 [clus9:19568] progressed_wait: ../../../../orte/mca/plm/base/plm_base_orted_cmds.c 189 [clus9:19568] defining message event: ../../../../orte/mca/plm/base/plm_base_orted_cmds.c 198 [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor: processing commands completed [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD [clus9:19568] [[65478,0],0] orted_cmd: received exit cmd [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor: processing commands completed [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [1,3]:[clus4:18113] pml_v: vprotocol:receiver:eventlog_write_log writing event log [1,3]:[clus4:18113] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [1,3]:[clus4:18113] pml_v: vprotocol:receiver:eventlog_write_log writing event BUFFER (1024 elements of 1 bytes) [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [1,2]:[clus3:21616] pml_v: vprotocol:receiver:recv from=0 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=6 (PRE) [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [1,3]:[clus4:18113] pml_v: vprotocol:receiver:recv from=0 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=7 (POS) [1,3]:[clus4:18113] pml_v: vprotocol:receiver:recv from=0 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=7 (PRE) [1,2]:[clus3:21616] pml_v: vprotocol:receiver:eventlog_write_log writing event log [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [1,2]:[clus3:21616] pml_v: vprotocol:receiver:eventlog_write_log writing event HEADER (1 elements of 152 bytes) [1,2]:[clus3:21616] pml_v: vprotocol:receiver:eventlog_write_log writing event BUFFER (1024 elements of 1 bytes) [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- App. Process state updated for process [[65478,1],1] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,1] reported state UNDEFINED for proc [[65478,1],1] state NORMALLY TERMINATED pid 19425 exit_code 137 [clus9:19568] errmgr:hnp: job reported estado 128 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] errmgr:hnp:check_job_completed job [65478,1] is not terminated (2:4) [clus9:19568] [[65478,0],0] errmgr:hnp:check_job_completed at least one job is not terminated [1,2]:[clus3:21616] pml_v: vprotocol:receiver:recv from=0 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=7 (POS) [1,2]:[clus3:21616] pml_v: vprotocol:receiver:recv from=0 tag=-17 type=MPI_CHAR count=1024 size=1024 clock=7 (PRE) [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- Daemon state updated for process [[65478,0],1] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,0] reported state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION FAILURE pid 0 exit_code 1 [clus9:19568] errmgr:hnp: job reported estado 8192 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] Daemons terminating - recording daemon [[65478,0],1] as gone [clus5:19424] [[65478,0],1] orted_recv_cmd: received message from [[65478,0],0] [clus5:19424] defining message event: ../../orte/orted/orted_comm.c 174 [clus5:19424] [[65478,0],1] orted_recv_cmd: reissued recv [clus5:19424] [[65478,0],1] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus5:19424] [[65478,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD [clus5:19424] [[65478,0],1] orted_cmd: received exit cmd [clus5:19424] [[65478,0],1] errmgr:orted got state NORMALLY TERMINATED for proc [[65478,1],1] pid 19425 [clus5:19424] [[65478,0],1] errmgr:orted reporting all procs in [65478,1] terminated [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- Daemon state updated for process [[65478,0],3] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,0] reported state COMMUNICATION FAILURE for proc [[65478,0],3] state COMMUNICATION FAILURE pid 0 exit_code 1 [clus9:19568] errmgr:hnp: job reported estado 8192 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] Daemons terminating - recording daemon [[65478,0],3] as gone [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus4:18112] [[65478,0],3] orted_recv_cmd: received message from [[65478,0],0] [clus4:18112] defining message event: ../../orte/orted/orted_comm.c 174 [clus4:18112] [[65478,0],3] orted_recv_cmd: reissued recv [clus4:18112] [[65478,0],3] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus4:18112] [[65478,0],3] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD [clus4:18112] [[65478,0],3] orted_cmd: received exit cmd [clus9:19568] defining message event: ../../../../../orte/mca/iof/hnp/iof_hnp_receive.c 228 [clus9:19568] [[65478,0],0] orted_recv_cmd: received message from [[65478,0],2] [clus9:19568] defining message event: ../../orte/orted/orted_comm.c 174 [clus9:19568] [[65478,0],0] orted_recv_cmd: reissued recv [clus9:19568] [[65478,0],0] orted_recv_cmd: received message from [[65478,0],2] [clus9:19568] defining message event: ../../orte/orted/orted_comm.c 174 [clus9:19568] [[65478,0],0] orted_recv_cmd: reissued recv [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,0],2] for tag 1 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: Unknown Command! [clus9:19568] [[65478,0],0] orted_recv: (RADIC)update state request from [[65478,0],2] [clus9:19568] UNPACK NAME correcto:[[65478,1],1] [clus9:19568] UNPACK STATE correcto:17 [clus9:19568] UNPACK PID correcto:21705 [clus9:19568] UNPACK EXIT CODE correcto:149 [clus9:19568] Intentando recuperar al proc [[65478,1],1], soy el HNP [clus3:21615] defining message event: ../../../../../orte/mca/iof/orted/iof_orted_read.c 218 [clus3:21615] [[65478,0],2] orte:daemon:cmd:processor called by [[65478,0],2] for tag 1 [clus3:21615] [[65478,0],2] orted:comm:process_commands() Processing Command: ORTE_DAEMON_WAITPID_FIRED [clus3:21615] [[65478,0],2] orted_cmd: received waitpid_fired cmd [clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC for proc [[65478,1],1] pid 21705 [clus3:21615] [[65478,0],2] errmgr:orted RADIC enabled, ignorando abort del proc [[65478,1],1] (OK, let's restart it) [clus3:21615] CHILD a restaurar [[65478,1],1] [clus3:21615] ENVIANDO AL ERRMGR HNP CON JOBSTATE = UNDEFINED y state del proc 17 [clus3:21615] ENVIANDO AL HNP EL PROCESO A RESTAURAR [clus3:21615] [[65478,0],2] orte:daemon:cmd:processor: processing commands completed [clus3:21615] [[65478,0],2] orted_recv_cmd: received message from [[65478,0],0] [clus3:21615] defining message event: ../../orte/orted/orted_comm.c 174 [clus3:21615] [[65478,0],2] orted_recv_cmd: reissued recv [clus3:21615] [[65478,0],2] orte:daemon:cmd:processor called by [[65478,0],2] for tag 1 [clus3:21615] [[65478,0],2] orted:comm:process_commands() Processing Command: ORTE_DAEMON_IOF_COMPLETE [clus3:21615] [[65478,0],2] orted_cmd: received iof_complete cmd [clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC for proc [[65478,1],1] pid 21705 [clus3:21615] [[65478,0],2] errmgr:orted RADIC enabled, ignorando abort del proc [[65478,1],1] (OK, let's restart it) [clus3:21615] CHILD a restaurar [[65478,1],1] [clus3:21615] ENVIANDO AL ERRMGR HNP CON JOBSTATE = UNDEFINED y state del proc 17 [clus3:21615] ENVIANDO AL HNP EL PROCESO A RESTAURAR [clus3:21615] [[65478,0],2] orte:daemon:cmd:processor: processing commands completed [clus3:21615] [[65478,0],2] orte:daemon:cmd:processor called by [[65478,0],0] for tag 1 [clus3:21615] [[65478,0],2] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD [clus3:21615] [[65478,0],2] orted_cmd: received exit cmd [clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC for proc [[65478,1],1] pid 21705 [clus3:21615] [[65478,0],2] errmgr:orted RADIC enabled, ignorando abort del proc [[65478,1],1] (OK, let's restart it) [clus3:21615] CHILD a restaurar [[65478,1],1] [clus3:21615] ENVIANDO AL ERRMGR HNP CON JOBSTATE = UNDEFINED y state del proc 17 [clus3:21615] ENVIANDO AL HNP EL PROCESO A RESTAURAR [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,1] reported state UNDEFINED for proc [[65478,1],1] state UNKNOWN STATE! pid 21705 exit_code 149 [clus9:19568] errmgr:hnp: job reported estado 17 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] errmgr:hnp: ESTADO = ORTE_PROC_STATE_FAULT [clus9:19568] [[65478,0],0] RELOCANDO PROC [[65478,1],1] of node node3 and jdata jobid [65478,1] and pdata [[65478,1],1] and 0 [clus9:19568] Pasé el getitem [clus9:19568] Pasé el max-restart [clus9:19568] [[65478,0],0] LUEGO DEL APP RELOCANDO PROC [[65478,1],1] [clus9:19568] [[65478,0],0]:../../../../orte/mca/plm/base/plm_base_rsh_support.c(1438) reseting exit status [clus9:19568] [[65478,0],0] RELOCATING APP [[65478,1],1] [clus9:19568] [[65478,0],0] ANTES DEL SPAWN [clus9:19568] [[65478,0],0] orted_recv_cmd: received message from [[65478,0],2] [clus9:19568] defining message event: ../../orte/orted/orted_comm.c 174 [clus9:19568] [[65478,0],0] orted_recv_cmd: reissued recv [clus9:19568] [[65478,0],0] orte:daemon:cmd:processor called by [[65478,0],2] for tag 1 [clus9:19568] [[65478,0],0] orted:comm:process_commands() Processing Command: Unknown Command! [clus9:19568] [[65478,0],0] orted_recv: (RADIC)update state request from [[65478,0],2] [clus9:19568] UNPACK NAME correcto:[[65478,1],1] [clus9:19568] UNPACK STATE correcto:17 [clus9:19568] UNPACK PID correcto:21705 [clus9:19568] UNPACK EXIT CODE correcto:149 [clus9:19568] Intentando recuperar al proc [[65478,1],1], soy el HNP [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- App. Process state updated for process [[65478,1],1] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,1] reported state UNDEFINED for proc [[65478,1],1] state UNKNOWN STATE! pid 21705 exit_code 149 [clus9:19568] errmgr:hnp: job reported estado 17 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] errmgr:hnp: ESTADO = ORTE_PROC_STATE_FAULT [clus9:19568] [[65478,0],0] RELOCANDO PROC [[65478,1],1] of node node3 and jdata jobid [65478,1] and pdata [[65478,1],1] and 0 [clus9:19568] Pasé el getitem [clus9:19568] Pasé el max-restart [clus9:19568] [[65478,0],0] LUEGO DEL APP RELOCANDO PROC [[65478,1],1] [clus9:19568] [[65478,0],0]:../../../../orte/mca/plm/base/plm_base_rsh_support.c(1438) reseting exit status [clus9:19568] [[65478,0],0] RELOCATING APP [[65478,1],1] [clus9:19568] [[65478,0],0] ANTES DEL SPAWN [clus9:19568] errmgr:hnp:update_state() [[65478,0],0]) ------- Daemon state updated for process [[65478,0],2] [clus9:19568] [[65478,0],0] errmgr:hnp: job [65478,0] reported state COMMUNICATION FAILURE for proc [[65478,0],2] state COMMUNICATION FAILURE pid 0 exit_code 1 [clus9:19568] errmgr:hnp: job reported estado 8192 [clus9:19568] [[65478,0],0] errmgr:hnp: Antes de del switch por estado de proceso [clus9:19568] [[65478,0],0] Daemons terminating - recording daemon [[65478,0],2] as gone [clus9:19568] [[65478,0],0] orteds complete - exiting [clus9:19568] errmgr:hnp: close() [clus9:19568] mca: base: close: component hnp closed [clus9:19568] mca: base: close: unloading component hnp rm: cannot remove `/tmp/radic/3': No such file or directory rm: cannot remove `/tmp/radic/2': No such file or directory