Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Arif Ali (aali_at_[hidden])
Date: 2007-01-19 12:51:49


see below for answers,

regards,
Arif Ali
Software Engineer
OCF plc

Mobile: +44 (0)7970 148 122
Office: +44 (0)114 257 2200
Fax: +44 (0)114 257 0022
Email: aali_at_[hidden]
Web: http://www.ocf.co.uk

Skype: arif_ali80
MSN: aali_at_[hidden]

Jeff Squyres wrote:
> Beware: this is a lengthy, detailed message.
>
> On Jan 18, 2007, at 3:53 PM, Arif Ali wrote:
>
>
>> 1. We have
>> HW
>> * 2xBladecenter H
>> * 2xCisco Infiniband Switch Modules
>> * 1xCisco Infiniband Switch
>> * 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA
>>
>
> Can you provide the details of your Cisco HCA?
>
*PRODUCT TYPE*:
Cisco 4x InfiniBand Host Channel Adapter Expansion Card
*DEVICE TYPE*:
Network adapter
*PORTS*:
2 InfiniBand ports
*DATA TRANSFER RATE*:
10 Gbps
*COMPAT*:
IBM BladeCenter
• The Cisco 4x InfiniBand Host Channel Adapter Expansion Card for IBM
BladeCenter provides InfiniBand I/O capability to processor blades in
IBM BladeCenter unit
• The host channel adapter adds 2 InfiniBand ports to the CPU blade cards
to create an IB-capable high density cluster
• PCI-Express Interface to dual 4x InfiniBand bridge
• Line rate of the interfaces are 10 Gbps per link, theoretical maximum
• 128 MB table memory (133 MHz DDR SDRAM)
• I2C serial EEPROM holding system Vital Product Data (VPD)
• IBM proprietary blade daughter card form factor
• Forced air cooling compatible for highly reliable operation

The lspci -vvv for the card gives me the following information

0c:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
(Tavor compatibility mode) (rev a0)
Subsystem: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor
compatibility mode)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size 20
Interrupt: pin A routed to IRQ 36
Region 0: Memory at 100b8900000 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at 100b8000000 (64-bit, prefetchable) [size=8M]
Region 4: Memory at 100b0000000 (64-bit, prefetchable) [size=128M]
Expansion ROM at 100b8800000 [disabled] [size=1M]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
Vector table: BAR=0 offset=00082000
PBA: BAR=0 offset=00082200
Capabilities: [60] Express Endpoint IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <64ns, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
Link: Supported Speed 2.5Gb/s, Width x8, ASPM L0s, Port 8
Link: Latency L0s unlimited, L1 unlimited
Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
Link: Speed 2.5Gb/s, Width x8

>
>> SW
>> * SLES 10
>> * OFED 1.1 w. OpenMPI 1.1.1
>>
>> I am running the Intel MPI Benchmark (IMB) on the cluster as a part
>> of validation process for the customer.
>>
>> I have tried the OpenMPI that comes with OFED 1.1, which gave
>> spurious "Not Enough Memory" error messages, after looking through
>> FAQs (with the help of Cisco) I was able to find the problems and
>> fixes. I used the FAQs to add unlimited soft and hard limits for
>> memlock, turned RDMA off by using "--mca btl_openib_flags 1". This
>> still did not work, and still got the Memory problems.
>>
>
> As a clarification: I suggested setting the btl_openib_flags to 1 as
> one means of [potentially] reducing the amount of registered memory
> to verify that the amount of registered memory available in the
> system is the problem (especially because it was dying with large
> messages in the all-to-all pattern). With that setting, we got
> through the alltoall test (which we previously couldn't). So it
> seemed to indicate that on that platform, there isn't much registered
> memory available (even though there's 8GB available on each blade).
>
> Are you saying that a full run of the IMB still failed with the same
> "cannot register any more memory" kind of error?
>
> I checked with Brad Benton -- an OMPI developer from IBM -- he
> confirms that on the JS21s, depending on the version of your
> firmware, you will be limited to 256M or 512M of registerable memory
> (256M = older firmware, 512M = newer firmware). This could very
> definitely be a factor in what is happening here.
>
> Can you let us know what version of the firmware you have?
>
The firmware for the blade is the latest, as the IB cards would not be
recognised
On 06/09/02006 on the following was released, this is the only latest
one on the IBM webpage
*Version 2.00, 01MB245_300_002*
>
>> I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed
>> miserably.
>>
>
> Can you describe what happened there? Is it failing in a different way?
>
Here's the output

#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
#---------------------------------------------------
# Date : Fri Jan 19 17:33:52 2007
# Machine : ppc64# System : Linux
# Release : 2.6.16.21-0.8-ppc64
# Version : #1 SMP Mon Jul 3 18:25:39 UTC 2006

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Alltoall
# Bcast
# Barrier

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 58 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.76 0.00
1 1000 1.88 0.51
2 1000 1.89 1.01
4 1000 1.91 2.00
8 1000 1.88 4.05
16 1000 2.02 7.55
32 1000 2.05 14.88
[0,1,4][btl_openib_component.c:1153:btl_openib_component_progress] from
node03 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
status number 10 for wr_id 268969528 opcode 128
[0,1,28][btl_openib_component.c:1153:btl_openib_component_progress] from
node09 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
status number 10 for wr_id 268906808 opcode 128
[0,1,58][btl_openib_component.c:1153:btl_openib_component_progress] from
node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
status number 10 for wr_id 268919352 opcode 256614836
[0,1,0][btl_openib_component.c:1153:btl_openib_component_progress] from
node02 to: node03 error polling HP CQ with status WORK REQUEST FLUSHED
ERROR status number 5 for wr_id 276070200 opcode 0
[0,1,59][btl_openib_component.c:1153:btl_openib_component_progress] from
node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR
status number 10 for wr_id 268919352 opcode 256614836
mpirun noticed that job rank 0 with PID 0 on node node02 exited on
signal 15 (Terminated).
55 additional processes aborted (not shown)
>
>> I then tried the released version of the OpenMPI-1.2b3, which got
>> me further than before. Now the benchmark goes through all the
>> tests until Allgatherv finishes, and it seems that it is waiting to
>> start AlltoAll, I have waited about 12 hours to see if this
>> continues. I have since then managed to run AlltoAll, and the rest
>> of the benchmark separately.
>>
>
> If it does not continue within a few minutes, it's not going to go
> anywhere. IMB does do "warmup" sends that may take a few minutes,
> but if you've gone 5-10 minutes with no activity, it's likely to be
> hung.
>
> FWIW: I can run IMB on 64 processes (16 hosts, 4ppn -- but not a
> blade center) with no problem. I.e., it doesn't hang/crash.
>
> Hanging instead of crashing may still be a side-effect of running out
> of DMA-able memory -- I don't know enough about the IBM hardware to
> say. I doubt that we have explored the error scenarios in OMPI too
> much; it's pretty safe to say that if limits are not used and the
> system runs out of DMA-able memory, Bad / Undefined things may happen
> (a "good" scenario would be that the process/MPI job aborts, a "bad"
> scenario would be some kind of deadlock situation).
>
>
>> I have tried a few tunable paramaters, that was suggested by Cisco,
>> which improved the results, but still hung. The parameters that I
>> have used to try and diagnose are below. I used the debug/verbose
>> variables to see if I could see if I could get error messages on
>> the running of the benchmark.
>>
>> #orte_debug=1
>> #btl_openib_verbose=1
>> #mca_verbose=1
>> #btl_base_debug=1
>> btl_openib_flags=1
>> mpi_leave_pinned=1
>> mpool_base_use_mem_hooks=1
>>
>
> Note that in that list, only the btl_openib_flags parameter will
> [potentially] decrease the amount of registered memory used. Also,
> note that mpi_leave_pinned is only useful when utilizing RDMA
> operations; so it's effectively a no-op when btl_openib_flags is set
> to 1.
>
> --> For those jumping into the conversation late, the value of
> btl_openib_flags is a bit mask with the following bits: SEND=1,
> PUT=2, GET=4.
>
> With all that was said above, let me provide a few options for
> decreasing the amount of registered memory that OMPI uses and also
> describe a way to put a strict limit on how much registered memory
> OMPI will use.
>
> I'll create some FAQ entries about these exact topics in the Near
> Future that will go into more detail, but it might take a few days
> because FAQ wording is tricky; the algorithms that OMPI uses and the
> tunable parameters that it exports are quite complicated -- I'll want
> to sure it's precisely correct for those who land there via Google.
> Here's the quick version (Galen/Gleb/Pasha: please correct me if I
> get these details incorrect -- thanks!):
>
> - All internal-to-OMPI registered buffers -- whether they are used
> for sending or receiving -- are cached on freelists. So if OMPI
> registers an internal buffer, sends from it, and then is done with
> it, the buffer is not de-registered -- it is put back on the free
> list for use in the future.
>
> - OMPI makes IB connections to peer MPI processes lazily. That is,
> the first time you MPI_SEND or MPI_RECV to a peer, OMPI makes the
> connection.
>
> - OMPI creates an initial set of pre-posted buffers when each IB port
> is initialized. The amount registered for each IB endpoint (i.e.,
> ports and LIDs) in use on the host by the MPI process upon MPI_INIT is:
>
> 2 * btl_openib_free_list_inc *
> (btl_openib_max_send_size + btl_openib_eager_limit)
>
> => NOTE: There's some pretty pictures of the exact meanings of the
> max send size and eager limit and how they are used in this paper:
> http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/.
>
> The "2" is because there are actually 2 free lists -- one for sending
> buffers and one for receiving buffers. Default values for these
> three MCA parameters are 32 (free_list_inc), 64k (max_send_size), 12k
> (eager_limit), respectively. So each MPI process will preregister
> about 4.75MB of memory per endpoint in use on the host. Since these
> are all MCA parameters, they are all adjustable at run-time.
>
> - OMPI then pre-registers and pre-posts receive buffers when each
> lazy IB connection is made. The buffers are drawn from the freelists
> mentioned above, so the first few connections may not actually
> register any *new* memory. The freelists register more memory and
> dole it out as necessary when requests are made that cannot be
> satisfied by what is already on the freelist.
>
> - The number of pre-posted receiver buffers are controlled via the
> btl_openib_rd_num and btl_openib_rd_win MCA parameters. OMPI pre-
> posts btl_openib_rd_num plus a few more (for control messages) --
> resulting in 11 buffers by default per queue pair (OMPI uses 2 QPs,
> one high priority for eager fragments and one low priority for send
> fragments) per endpoint. So there are
>
> 11 * (12k + 64k) = 836k
>
> buffers pre posted for each IB connection endpoint.
>
> => What I'm guessing is happening in your network is that IMB is
> hitting some communication intensive portions and network traffic
> either backs up, starts getting congested, or otherwise becomes
> "slow", meaning that OMPI is queueing up traffic faster than the
> network can process it. Hence, OMPI keeps registering more and more
> memory because there's no more memory available on the freelist to
> recycle.
>
> - The sending buffering behavior is regulated by the
> btl_openib_free_list_max MCA parameter, which defaults to -1 (meaning
> that the free list can grow to infinite size). You can set a cap on
> this, telling OMPI how many entries it is allowed to have on the
> freelist, but that doesn't have a direct correlation as to how much
> memory will actually be registered at any one time when
> btl_openib_flags > 1 (because OMPI will also be registering and
> caching user buffers). Also keep in mind that this MCA parameter
> governs the size of both sending and receiving buffer freelists.
>
> That being said, if you use btl_openib_flags=1, you can use
> btl_openib_free_list_max as a direct method (because OMPI will *not*
> be registering and caching user buffers), but you need to choose a
> value that will be acceptable for both the send and receive freelists.
>
> What should happen if OMPI hits the btl_openib_free_list_max limit is
> that the upper layer (called the "PML") will internally buffer
> messages until more IB registered buffers become available. It's not
> entirely accurate, but you can think of it as effectively multiple
> levels of queueing going on here: MPI requests, PML buffers, IB
> registered buffers, network. Fun stuff! :-)
>
> - A future OMPI feature is an MCA parameter called
> mpool_rdma_rcache_size_limit. It defaults to an "unlimited" value,
> which means that OMPI will try to register memory forever. But if
> you set it to a nonzero positive value (in bytes), OMPI will limit
> itself to that much registered memory for each MPI process. This MCA
> parameter unfortunately didn't make it into the 1.2 release, but will
> be included in some future release. This code is currently on the
> OMPI trunk (and nightly snapshots), but not available in the 1.2
> branch (and nightly snapshots/releases).
>
> =====
>
> With all those explanations, here's some recommendations for you:
>
> - Try simply setting the size of the eager limit and max send size to
> smaller values, perhaps 4k for the eager limit and 12k for the max
> send size. This will decrease the amount of registered memory that
> OMPI uses for each connection.
>
> - Try setting btl_openib_free_list_max, perhaps in conjunction with
> btl_openib_flags=1, to allow you to directly set indirectly or
> exactly how much registered memory is used per endpoint.
>
> - If you want to explore the OMPI trunk (with all the normal
> disclaimers about development code), try setting
> mpool_rdma_rcache_size_limit to a fixed value.
>
> Keep in mind that the intermixing of all of these values is quite
> complicated. It's a very, very thin line to walk to balance resource
> constraints and application performance. Tweaking one parameter may
> give you good resource limits but hose your overall performance.
> Another dimension here is that different applications will likely use
> different communication patterns, so different sets of values may be
> suitable for different applications. It's a complicated parameter
> space problem. :-\
>
>
>> 2. On another side note, I am having similar problems on another
>> customer's cluster, where the benchmark hangs but at a different
>> place each time.
>>
>> HW specs
>> * 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs
>> * 1x Voltaire Switch
>> SW
>> * master: RHEL 4 AS U3
>> * compute: RHEL 4 WS U3
>> * OFED 1.1.1 w. OpenMPI-1.1.2
>>
>
> For InfiniPath HCAs, you should probably be using the psm MTL instead
> of the openib BTL.
>
> The short version explanation between the two is that MTL plugins are
> designed for networks that export MPI-like interfaces (e.g., portals,
> tports, MX, InifiniPath). BTL plugins are more geared towards
> networks that export RDMA interfaces. You can force using the psm
> MTL with:
>
> mpirun --mca pml cm ...
>
> This tells OMPI to use the "cm" PML plugin (PML is the back end to
> MPI point-to-point), which, if you've built the "psm" MTL plugin (psm
> is the InfiniPath library glue), will use the InfiniPath native back-
> end library which will do nice things. Beyond that, someone else
> will have to answer -- I have no experience with the psm MTL...
>
> Hope this helps!
>
>