bones> cat ~/.openmpi/mca-params.conf btl_base_verbose=1 btl_openib_enable_apm_over_ports=1 bones> cat machines bones sulu bones> mpirun -np 2 -machinefile machines simple_sendrecv [bones][[37211,1],0][btl_openib_ip.c:365:add_rdma_addr] Adding addr 192.168.1.230 (0xe601a8c0) subnet 0xc0a80100 as mlx4_0:1 [sulu][[37211,1],1][btl_openib_ip.c:365:add_rdma_addr] Adding addr 192.168.1.222 (0xde01a8c0) subnet 0xc0a80100 as mlx4_0:1 [bones][[37211,1],0][btl_openib_component.c:622:init_one_port] looking for mlx4_0:1 GID index 0 [bones][[37211,1],0][btl_openib_component.c:653:init_one_port] my IB subnet_id for HCA mlx4_0 port 1 is fe80000000000000 [bones][[37211,1],0][btl_openib_component.c:1322:init_apm_port] APM-PORT: Setting alternative port - 2, lid - 3 [bones][[37211,1],0][btl_openib_component.c:1426:setup_qps] pp: rd_num is 256 rd_low is 192 rd_win 128 rd_rsv 4 [bones][[37211,1],0][btl_openib_component.c:1471:setup_qps] srq: rd_num is 256 rd_low is 128 sd_max is 32 rd_max is 64 srq_limit is 12 [sulu][[37211,1],1][btl_openib_component.c:622:init_one_port] looking for mlx4_0:1 GID index 0 [bones][[37211,1],0][btl_openib_component.c:1471:setup_qps] srq: rd_num is 256 rd_low is 128 sd_max is 32 rd_max is 64 srq_limit is 12 [bones][[37211,1],0][btl_openib_component.c:1471:setup_qps] srq: rd_num is 256 rd_low is 128 sd_max is 32 rd_max is 64 srq_limit is 12 [bones][[37211,1],0][connect/btl_openib_connect_rdmacm.c:1804:rdmacm_component_query] rdmacm_component_query [bones][[37211,1],0][btl_openib_ip.c:133:mca_btl_openib_rdma_get_ipv4addr] Looking for mlx4_0:1 in IP address list [bones][[37211,1],0][btl_openib_ip.c:142:mca_btl_openib_rdma_get_ipv4addr] FOUND: mlx4_0:1 is 192.168.1.230 (0xe601a8c0) [bones][[37211,1],0][connect/btl_openib_connect_rdmacm.c:1714:ipaddrcheck] Found device mlx4_0:1 = IP address 192.168.1.230 (0xe601a8c0):45985 [bones][[37211,1],0][connect/btl_openib_connect_rdmacm.c:1740:ipaddrcheck] creating new server to listen on 192.168.1.230 (0xe601a8c0):45985 [sulu][[37211,1],1][btl_openib_component.c:653:init_one_port] my IB subnet_id for HCA mlx4_0 port 1 is fe80000000000000 [sulu][[37211,1],1][btl_openib_component.c:1322:init_apm_port] APM-PORT: Setting alternative port - 2, lid - 5 [bones][[37211,1],0][btl_openib_async.c:163:btl_openib_async_commandh] GOT event from -> 17 [sulu][[37211,1],1][btl_openib_component.c:1426:setup_qps] pp: rd_num is 256 rd_low is 192 rd_win 128 rd_rsv 4 [sulu][[37211,1],1][btl_openib_component.c:1471:setup_qps] srq: rd_num is 256 rd_low is 128 sd_max is 32 rd_max is 64 srq_limit is 12 [sulu][[37211,1],1][btl_openib_component.c:1471:setup_qps] srq: rd_num is 256 rd_low is 128 sd_max is 32 rd_max is 64 srq_limit is 12 [sulu][[37211,1],1][btl_openib_component.c:1471:setup_qps] srq: rd_num is 256 rd_low is 128 sd_max is 32 rd_max is 64 srq_limit is 12 [bones][[37211,1],0][btl_openib_async.c:166:btl_openib_async_commandh] Adding device [17] to async event poll[1] [sulu][[37211,1],1][connect/btl_openib_connect_rdmacm.c:1804:rdmacm_component_query] rdmacm_component_query [sulu][[37211,1],1][btl_openib_ip.c:133:mca_btl_openib_rdma_get_ipv4addr] Looking for mlx4_0:1 in IP address list [sulu][[37211,1],1][btl_openib_ip.c:142:mca_btl_openib_rdma_get_ipv4addr] FOUND: mlx4_0:1 is 192.168.1.222 (0xde01a8c0) [sulu][[37211,1],1][connect/btl_openib_connect_rdmacm.c:1714:ipaddrcheck] Found device mlx4_0:1 = IP address 192.168.1.222 (0xde01a8c0):55744 [sulu][[37211,1],1][connect/btl_openib_connect_rdmacm.c:1740:ipaddrcheck] creating new server to listen on 192.168.1.222 (0xde01a8c0):55744 [sulu][[37211,1],1][btl_openib_async.c:163:btl_openib_async_commandh] GOT event from -> 17 [sulu][[37211,1],1][btl_openib_async.c:166:btl_openib_async_commandh] Adding device [17] to async event poll[1] [bones][[37211,1],0][btl_openib_proc.c:176:mca_btl_openib_proc_create] unpack: 1 btls [bones][[37211,1],0][btl_openib_proc.c:196:mca_btl_openib_proc_create] unpacked btl 0: modex message, offset now 26 [bones][[37211,1],0][btl_openib_proc.c:202:mca_btl_openib_proc_create] unpacked btl 0: number of cpcs to follow 2 (offset now 27) [bones][[37211,1],0][btl_openib_proc.c:217:mca_btl_openib_proc_create] unpacked btl 0: cpc 0: index 0 (offset now 28) [bones][[37211,1],0][btl_openib_proc.c:221:mca_btl_openib_proc_create] unpacked btl 0: cpc 0: component oob [bones][[37211,1],0][btl_openib_proc.c:228:mca_btl_openib_proc_create] unpacked btl 0: cpc 0: priority 50, msg len 0 (offset now 30) [bones][[37211,1],0][btl_openib_proc.c:217:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: index 2 (offset now 31) [bones][[37211,1],0][btl_openib_proc.c:221:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: component rdmacm [bones][[37211,1],0][btl_openib_proc.c:228:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: priority 30, msg len 14 (offset now 33) [bones][[37211,1],0][btl_openib_proc.c:242:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: blob unpacked 16 80 (offset now 47) [bones][[37211,1],0][btl_openib_proc.c:259:mca_btl_openib_proc_create] unpacking done! [bones][[37211,1],0][btl_openib.c:649:mca_btl_openib_add_procs] got 1 port_infos [bones][[37211,1],0][btl_openib.c:652:mca_btl_openib_add_procs] got a subnet fe80000000000000 [bones][[37211,1],0][btl_openib.c:655:mca_btl_openib_add_procs] Got a matching subnet! [sulu][[37211,1],1][btl_openib_proc.c:176:mca_btl_openib_proc_create] unpack: 1 btls [sulu][[37211,1],1][btl_openib_proc.c:196:mca_btl_openib_proc_create] unpacked btl 0: modex message, offset now 26 [sulu][[37211,1],1][btl_openib_proc.c:202:mca_btl_openib_proc_create] unpacked btl 0: number of cpcs to follow 2 (offset now 27) [sulu][[37211,1],1][btl_openib_proc.c:217:mca_btl_openib_proc_create] unpacked btl 0: cpc 0: index 0 (offset now 28) [sulu][[37211,1],1][btl_openib_proc.c:221:mca_btl_openib_proc_create] unpacked btl 0: cpc 0: component oob [sulu][[37211,1],1][btl_openib_proc.c:228:mca_btl_openib_proc_create] unpacked btl 0: cpc 0: priority 50, msg len 0 (offset now 30) [sulu][[37211,1],1][btl_openib_proc.c:217:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: index 2 (offset now 31) [sulu][[37211,1],1][btl_openib_proc.c:221:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: component rdmacm [sulu][[37211,1],1][btl_openib_proc.c:228:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: priority 30, msg len 14 (offset now 33) [sulu][[37211,1],1][btl_openib_proc.c:242:mca_btl_openib_proc_create] unpacked btl 0: cpc 1: blob unpacked 16 80 (offset now 47) [sulu][[37211,1],1][btl_openib_proc.c:259:mca_btl_openib_proc_create] unpacking done! [sulu][[37211,1],1][btl_openib.c:649:mca_btl_openib_add_procs] got 1 port_infos [sulu][[37211,1],1][btl_openib.c:652:mca_btl_openib_add_procs] got a subnet fe80000000000000 [sulu][[37211,1],1][btl_openib.c:655:mca_btl_openib_add_procs] Got a matching subnet! Iteration 1 [bones][[37211,1],0][btl_openib.c:1134:mca_btl_openib_prepare_src] frag->sg_entry.lkey = 2281719324 .addr = 7fff2b580bd0 frag->segment.seg_key.key32[0] = 2281719324 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:544:send_connect_data] packing 1 of 12 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:551:send_connect_data] packing 1 of 15 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:596:send_connect_data] packing 1 of 13 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:602:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:609:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:627:send_connect_data] Sent QP Info, LID = 2, SUBNET = fe80000000000000 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:667:rml_recv_cb] unpacking 1 of 12 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:675:rml_recv_cb] unpacking 1 of 15 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:726:rml_recv_cb] unpacking 1 of 13 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:733:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:740:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:751:rml_recv_cb] Received QP Info, LID = 2, SUBNET = fe80000000000000 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:243:reply_start_connect] Initialized QPs, LID = 4 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:282:set_remote_info] Setting QP info, LID = 2 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:544:send_connect_data] packing 1 of 12 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:551:send_connect_data] packing 1 of 15 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:560:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:568:send_connect_data] packing 1 of 13 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:580:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:587:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:596:send_connect_data] packing 1 of 13 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:602:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:609:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:667:rml_recv_cb] unpacking 1 of 12 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:675:rml_recv_cb] unpacking 1 of 15 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:684:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:691:rml_recv_cb] unpacking 1 of 13 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:627:send_connect_data] Sent QP Info, LID = 4, SUBNET = fe80000000000000 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:708:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:716:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:726:rml_recv_cb] unpacking 1 of 13 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:733:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:740:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:751:rml_recv_cb] Received QP Info, LID = 4, SUBNET = fe80000000000000 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:282:set_remote_info] Setting QP info, LID = 4 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:544:send_connect_data] packing 1 of 12 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:551:send_connect_data] packing 1 of 15 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:560:send_connect_data] packing 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:568:send_connect_data] packing 1 of 13 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:627:send_connect_data] Sent QP Info, LID = 2, SUBNET = fe80000000000000 [bones][[37211,1],0][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [bones][[37211,1],0][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 5, src_bits: 0:0, old_dlid 4 [bones][[37211,1],0][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [bones][[37211,1],0][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 5, src_bits: 0:0, old_dlid 4 [bones][[37211,1],0][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [bones][[37211,1],0][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 5, src_bits: 0:0, old_dlid 4 [bones][[37211,1],0][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [bones][[37211,1],0][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 5, src_bits: 0:0, old_dlid 4 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:667:rml_recv_cb] unpacking 1 of 12 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:675:rml_recv_cb] unpacking 1 of 15 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:684:rml_recv_cb] unpacking 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:691:rml_recv_cb] unpacking 1 of 13 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:751:rml_recv_cb] Received QP Info, LID = 1, SUBNET = fe80000000000000 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:544:send_connect_data] packing 1 of 12 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:551:send_connect_data] packing 1 of 15 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:560:send_connect_data] packing 1 of 14 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:568:send_connect_data] packing 1 of 13 [sulu][[37211,1],1][connect/btl_openib_connect_oob.c:627:send_connect_data] Sent QP Info, LID = 4, SUBNET = fe80000000000000 [sulu][[37211,1],1][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [sulu][[37211,1],1][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 3, src_bits: 0:0, old_dlid 2 [sulu][[37211,1],1][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [sulu][[37211,1],1][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 3, src_bits: 0:0, old_dlid 2 [sulu][[37211,1],1][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [sulu][[37211,1],1][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 3, src_bits: 0:0, old_dlid 2 [sulu][[37211,1],1][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [sulu][[37211,1],1][btl_openib_async.c:545:apm_update_port] New APM port loaded: alt_src_port:2, dlid: 3, src_bits: 0:0, old_dlid 2 [sulu][[37211,1],1][btl_openib.c:1235:mca_btl_openib_prepare_dst] frag->sg_entry.lkey = 1476416538 .addr = 7fff2eebdd50 frag->segment.seg_key.key32[0] = 1476416538 Sleeping... [bones][[37211,1],0][btl_openib.c:1134:mca_btl_openib_prepare_src] frag->sg_entry.lkey = 2281719324 .addr = 7fff2b580bd0 frag->segment.seg_key.key32[0] = 2281719324 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:667:rml_recv_cb] unpacking 1 of 12 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:675:rml_recv_cb] unpacking 1 of 15 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:684:rml_recv_cb] unpacking 1 of 14 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:691:rml_recv_cb] unpacking 1 of 13 Done Iteration 2 [bones][[37211,1],0][connect/btl_openib_connect_oob.c:751:rml_recv_cb] Received QP Info, LID = 1, SUBNET = fe80000000000000 [sulu][[37211,1],1][btl_openib.c:1235:mca_btl_openib_prepare_dst] frag->sg_entry.lkey = 1476416538 .addr = 7fff2eebdd50 frag->segment.seg_key.key32[0] = 1476416538 Sleeping...Done Iteration 3 [bones][[37211,1],0][btl_openib.c:1134:mca_btl_openib_prepare_src] frag->sg_entry.lkey = 2281719324 .addr = 7fff2b580bd0 frag->segment.seg_key.key32[0] = 2281719324 [sulu][[37211,1],1][btl_openib.c:1235:mca_btl_openib_prepare_dst] frag->sg_entry.lkey = 1476416538 .addr = 7fff2eebdd50 frag->segment.seg_key.key32[0] = 1476416538 Sleeping... -------------------------------------------------------------------------- The OpenFabrics stack has reported a network error event. Open MPI will try to continue, but your job may end up failing. Local host: sulu MPI process PID: 9528 Error number: 10 (IBV_EVENT_PORT_ERR) This error may indicate connectivity problems within the fabric; please contact your system administrator. -------------------------------------------------------------------------- [bones][[37211,1],0][btl_openib.c:1134:mca_btl_openib_prepare_src] frag->sg_entry.lkey = 2281719324 .addr = 7fff2b580bd0 frag->segment.seg_key.key32[0] = 2281719324 Done Iteration 4 [bones][[37211,1],0][btl_openib_async.c:327:btl_openib_async_deviceh] Alternative path migration event reported [bones][[37211,1],0][btl_openib_async.c:329:btl_openib_async_deviceh] Trying to find additional path... [bones][[37211,1],0][btl_openib_async.c:557:mca_btl_openib_load_apm] APM: Loading alternative path [bones][[37211,1],0][btl_openib_async.c:516:apm_update_port] APM: already all ports were used port_num 2 apm_port 2 [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [bones:11179] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [bones:11179] 1 more process has sent help message help-mpi-btl-openib.txt / of error event [[37211,1],0][btl_openib_component.c:3348:handle_wc] from bones to: sulu error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 19cc980 opcode 0 vendor error 129 qp_idx 0 -------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 20). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. Below is some information about the host that raised the error and the peer to which it was connected: Local host: bones Local device: mlx4_0 Peer host: sulu You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 11181 on node bones exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). --------------------------------------------------------------------------