Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Openib with > 32 cores per node
From: Robert Horton (r.horton_at_[hidden])
Date: 2011-05-19 07:28:03


Hi,

I'm having problems getting the MPIRandomAccess part of the HPCC
benchmark to run with more than 32 processes on each node (each node has
4 x AMD 6172 so 48 cores total). Once I go past 32 processes I get an
error like:

[compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] error creating qp errno says Cannot allocate memory
[compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:815:rml_recv_cb] error in endpoint reply start connect
[compute-1-13.local:06117] [[5637,0],0]-[[5637,1],18] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[compute-1-13.local:6137] *** An error occurred in MPI_Isend
[compute-1-13.local:6137] *** on communicator MPI_COMM_WORLD
[compute-1-13.local:6137] *** MPI_ERR_OTHER: known error not in list
[compute-1-13.local:6137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-1-13.local][[5637,1],26][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] error creating qp errno says Cannot allocate memory
[[5637,1],66][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc] from compute-1-13.local to: compute-1-13 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278870912 opcode

I've tried changing btl_openib_receive_queues from
P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
to
P,128,512,256,512:S,2048,512,256,32:S,12288,512,256,32:S,65536,512,256,32

doing this lets the code run without the error, but it does so extremely
slowly - I'm also seeing errors in dmesg such as:

CPU 12:
Modules linked in: nfs fscache nfs_acl blcr(U) blcr_imports(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state
 ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables cpufreq_ondemand powernow_k8 freq_table rdma_ucm(U) ib_sd
p(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) cxgb3(U)
mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_p
c lp parport joydev shpchp sg i2c_piix4 i2c_core ib_qib(U) dca ib_mad(U) ib_core(U) igb 8021q serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_
mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3980, comm: qib/12 Tainted: G 2.6.18-164.6.1.el5 #1
RIP: 0010:[<ffffffff80094409>] [<ffffffff80094409>] tasklet_action+0x90/0xfd
RSP: 0018:ffff810c2f1bff40 EFLAGS: 00000246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff810c2f1bff30
RDX: 0000000000000000 RSI: ffff81042f063400 RDI: ffffffff8030d180
RBP: ffff810c2f1bfec0 R08: 0000000000000001 R09: ffff8104aec2d000
R10: ffff810c2f1bff00 R11: ffff810c2f1bff00 R12: ffffffff8005dc8e
R13: ffff81042f063480 R14: ffffffff80077874 R15: ffff810c2f1bfec0
FS: 00002b20829592e0(0000) GS:ffff81042f186bc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b2080b70720 CR3: 0000000000201000 CR4: 00000000000006e0

Call Trace:
 <IRQ> [<ffffffff8001235a>] __do_softirq+0x89/0x133
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006cb20>] do_softirq+0x2c/0x85
 [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c
 <EOI> [<ffffffff800da30c>] __kmalloc+0x97/0x9f
 [<ffffffff88220d8b>] :ib_qib:qib_verbs_send+0xdb3/0x104a
 [<ffffffff80064b20>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff881f66ca>] :ib_qib:qib_make_rc_req+0xbb1/0xbbf
 [<ffffffff881f5b19>] :ib_qib:qib_make_rc_req+0x0/0xbbf
 [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950
 [<ffffffff881f8aa1>] :ib_qib:qib_do_send+0x91a/0x950
 [<ffffffff8002e2e3>] __wake_up+0x38/0x4f
 [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950
 [<ffffffff8004d7fb>] run_workqueue+0x94/0xe4
 [<ffffffff8004a043>] worker_thread+0x0/0x122
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8004a133>] worker_thread+0xf0/0x122
 [<ffffffff8008c3bd>] default_wake_function+0x0/0xe
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003297c>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003287e>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Any thoughts on how to proceed?

I'm running OpenMPI 1.4.3 compiled with gcc 4.1.2 and OFED 1.5.3.1

Thanks,
Rob

-- 
Robert Horton
System Administrator (Research Support) - School of Mathematical Sciences
Queen Mary, University of London
r.horton_at_[hidden]  -  +44 (0) 20 7882 7345