420 likes | 700 Views
Memory eXpansion Technology. Krishan Swarup Gupta Rabie A. Ramadan Supervised By: Prof. El-Rewini. Memory eXpansion Technology Agenda. Introduction (Krishan ) Motivation (Krishan ) A Breakthrough (Krishan ) Requirements (Krishan ) Terminology (Rabie) Architecture (Rabie)
E N D
Memory eXpansion Technology Krishan Swarup Gupta Rabie A. Ramadan Supervised By: Prof. El-Rewini
Memory eXpansion Technology Agenda • Introduction (Krishan ) • Motivation (Krishan ) • A Breakthrough (Krishan ) • Requirements (Krishan ) • Terminology (Rabie) • Architecture (Rabie) • Shared cache subsystem Requirements (Krishan ) • C-RAM Architecture (Krishan ) • Compression technique (Rabie) • Main Memory subsystem (Rabie) • Operating System Software (Rabie) • Performance (Rabie)
Introduction “Adding memory is often the most effective way to improve system performance, but it's a costly proposition," Mark Dean, IBM Fellow and Vice President of Systems Research.
Introduction • MXT is a hardware technology for compressing main memory contents. • MXT doubles the effective size of the main memory. • 512 MB installed memory appears as 1 GB. • This is done entirely in hardware, transparent to the CPUs, I/O devices, peripherals and all software including apps, device drivers and the kernel with the exception of less than hundred lines of code additions to the base kernel.
Motiviation • Memory seems to be cheap, But is not , especially when the system uses 512 MB or more. • Why bother with MXT to double the size of memory? • Simple ! , MXT saves money and lots of money. • Try the on-line price configurations of Compaq, IBM, Dell, etc to double the size of the system memory.
A Breakthrough • The Large Technology Installations can save millions of dollars. • The savings can be significant for both small and large customers, as memory comprises 40-70 percent of the total cost of most NT-based server configurations. • MXT is a hardware implementation that automatically stores frequently accessed data and instructions close to a computer's microprocessors so they can be accessed immediately. • MXT incorporates a new level of cache that is designed to efficiently handle data and instructions on a memory controller chip. • It is real and implemented by IBM eServer x330 with MXT released on 11th Feb 2002.
Requirements • Very fast compression/decompression h/w is required permitting operations at main-memory bandwidth. • Since with compression, the logical total main-memory size may vary dynamically, changes in memory management must be made to the operating system. • A way must be found to efficiently store and access the variable-length objects obtained from compression.
Terminology ASCI Machine • ASCI is the US Department of Energy's Accelerated Strategic Computing Initiative, a collaboration between three US national defense laboratories • Aim to give researchers the five-order-of-magnitude increase in computing performance over current technology • MIMD distributed memory • message-passing supercomputer • The architecture is scalable • communication bandwidth • main memory • internal disk storage capacity • I/O
Terminologies RAM • Conventional DRAM • Synchronous DRAM (SDRAM) • DDR SDRAM • SIMM • DIMMS • Interleaving
MXT Architecture • A Collection of processors are connected to a common SDRAM-based main memory through a memory controller chip. • MXT incorporates the two level of architecture consisting of a large shared cache coupled with a typical main memory. • Three ways to manage memory : • Organizing M to be a linear space, where variable-length intervals are allocated and deallocated. • Organizing M as a collection of blocks of possibly multiple sizes, where space for a variable-length object is allocated as an integral number of such blocks. • Organizing M as a collection of blocks, but permitting a variable amount of space to be allocated within a block.
Cyclic Redundancy Code (CRC) • A number derived from a data block • A CRC is more complicated than a checksum • Calculated using division by using shifts and exclusive ORs • Generator Polynomial • CRCs treat blocks of input bits as coefficient-sets for polynomials • EX. 10100000 1*x7 + 0*x6 + 1*x5 + 0*x4 + 0*x3 + 0*x2 + 0*x1 + 0*x0 • The reminder of the Division is the checksum • For More Info. Please visit this web site • http://www.4d.com/docs/CMU/CMU79909.HTM
Processor Processor Processor Cache (L1) Cache (L1) Cache (L1) Cache (L2) Cache (L2) Cache (L2) Shared L3 Cache comp/decomp Compressed Main Memory
Shared Cache Subsystem • The shared cache L3 provides low-latency processor and I/O subsystem access to frequently accessed uncompressed data. • The cache is partitioned into a quantity of lines called cache lines, with each line an associative storage unit equivalent in size to the 1KB uncompressed data block size. • A cache directory is used to keep track of real-memory tag address which correspond to the cached address that can be stored within the line.
Shared Cache Subsystem • Three primary architecture :- • The independent cache array scheme • Large independent data-cache memory is implemented using low cost double-data-rate (SDRAM) technology. • Outside the memory controller chip, while the associated cache directory is implemented on the chip. • The cache size is limited primarily by the size of the cache directory. • Cache interface can be optimized for the lowest-latency access by the processor.
Memory eXpansion Technology Shared cache subsystem • The compressed main memory partition scheme • The cache controller and the memory controller share the same storage array via the same physical interface. • Data is shuttled back and forth between compressed main memory region and uncompressed cache through the compression hardware during cache line replacement. • Compressed cache size can be readily optimized to specific system application. • Contention for main memory physical interface by latency-sensitive cache controller.
Memory eXpansion Technology Shared cache subsystem • The distributed cache scheme • The cache is distributed throughout the compressed memory as a number of uncompressed lines. Only the most recently used n lines are selected to make up the cache. • Data is shuttled in and out of the compressed memory, changing the compressed state as it is passed through the compression logic during cache-line replacement. • Effective cache size may be dynamically optimized during system operation by simply changing the maximum number of uncompressed lines. • Contention for main memory physical interface. • Greater average latency associated with the cache directory references.
Memory eXpansion Technology C-RAM Architecture • Logically, the memory M consists of a collection of randomly accessible fixed-size lines, where L is the line size. • Internally, the ith line is stored in a compressed format as L(i) bytes, where L(i) <= L, and where L(i) may change on each cache cast-out of this line.
Memory eXpansion Technology C-RAM Architecture • M comprises a standard random-access memory with a minimum access size (granule) of g bytes. We will generally assume that g is 32. • Memory accesses invoke a translation between a logical line address and an internal address. This correspondence is stored in a directory D contained in M. • Translation, fetching, and memory management within the C-RAM are carried out by a memory controller rather than by operating system (OS) software.
L3 and C-RAM organization L3 L3 Directory L3 Cache Lines Miss Store Decompressor Compressor Address Read Write Line 2 A1 Line1 M Line 3 A2 Line2 A3 Line3 Line4 A4 Blocks Line 4
Memory eXpansion Technology C-RAM Architecture • Each directory Entry contains : • Flags. • Fragment combining information. • Pointers for up to four block. • On an L3 cache miss, the memory controller and decompression h/w find the blocks allocated to store the compressed line and dynamically decompress the line to handle the miss.
Memory eXpansion Technology C-RAM Architecture • When a new or modified line is stored, the blocks currently allocated to the line are made free, and the line is then compressed and stored in the C-RAM by allocating the required number of blocks.
Memory eXpansion Technology C-RAM Architecture • Example • Pages size is 4KB. L3 cache immediately above C-RAM has line size of 1KB. Each line compresses to 1, 2, 3, 4…, or 1024 bytes with equal likelyhood. • Expected compressed line size would be 512.5 bytes. This yields to 50.5% compression. • But the problem is ????????????? • Block size is 256-bytes • FRAGMENTATION “left over space in the block”
Memory eXpansion Technology C-RAM Architecture • Approaches dealing with fragmentation problem : • Make block size smaller. Size of directory entry will increase dramatically. • Combine two or more fragments, that is, the “left-over” pieces in the last blocks used to store compressed lines, into single blocks. • The set of lines for which fragment combining is allowed is called “cohort”.
Memory eXpansion Technology C-RAM Architecture • Cohort size : to have a small upper bound on the time required for directory scans, ideally the size of cohorts should be small. • Two ways in which the cohort are determined. • Partitioned cohort : • Lines are divided into a number of disjoint sets, where each such set is a cohort. For example : with a cohort of size 2, the first two 1KB lines in each 4KB page could form one cohort and the last two lines another cohort.
Memory eXpansion Technology C-RAM Architecture • Sliding cohort : • Cohorts are not disjoint, but overlap. For example, with a cohort of size 4, the cohort corresponding to any given line could consist of the set containing that line and the previous three lines, and similarly for other cohort sizes. Less fragmentation then partitioned cohort.
Memory eXpansion Technology C-RAM Architecture • The mathod by which fragments are combined • The number of fragments that can be combined into a block. • 2 way combining (2 fragments per block) • 3 way combining (3 fragments per block) • Which fragment (or fragments) to choose. • First fit, Best fit • Fragment Contention, Optimal Fit
Memory eXpansion Technology C-RAM Architecture • Design Of Directory Structure : • Static Directory : • It is configured so as to have the required number of entries to support a maximum compression factor of F. That is, if the C-RAM has a capacity of N uncompressed lines, the directory contains entries for FN lines. • A possible problem with this type of design is that the maximum compression is limited to a predetermined value.
Memory eXpansion Technology C-RAM Architecture • Dynamic Directory : • Using a dynamic directory structure, directory entries are created (deleted) whenever real addresses are allocated (deallocated). In this case, free main-memory blocks could be allocated (deallocated) and used for the directory entries for one or more pages whenever the pages were created (deleted).
XMT Main Memory Subsystem
LZ77 Compression Technique • The LZ77 output is a series of byte values intersperse with (index, length) pairs. Each byte value is written as is to the output. The (index, length) pairs are written to the output as a pair of integers (index first, then length) each of which has 256 added to the value. This allows for the index and length values to be distinguished from the byte values. • LZ77 in operation
IBM Implementation of the compression technique • Divide the data into n partitions • A compression engine for each part • Shared dictionary • Typically • 4 compression engines • 256 B ( a quarter of 1KB uncompressed data) • (1B/ cycle 4B/ cycle ) or • ( 2B/cycle 8B/cycle) when double clocked.
Uncompressed Memory • Unescorted region is used by SST for additional and future needs.
Main Memory Subsystem • Comprises SDRAM and Dual in-line Memoery Modules DIMMs • The controller supports two separate DIMMs • Can be configured to operate with compression disabled, enabled for specific address ranges, or completely enables. • Sector Translation Table • Sectored Memory
Cont. • Data • 1KB <= 120 bits Compression – Stored in SST • 1KB > 120 bits Compression – Pointer to the sector • Uncompressed • Directly accessed without SST reference
Reliability-Availability-Serviceability (RAS) • Sector translation table entry parity checking. • Sector free-list parity checking. • Sector out-of-range checking. • Sectored memory-overrun detection. • Sectors-used threshold detection (2). • Compressor/decompressor validity checking. • Compressed-memory CRC protection.
Commodity Duplex Memory • Fault tolerance technique – never found before
Operating System Software • Can not distinguish between XMT and Conventional Memory HW Environment • When the memory over utilized the system fails • Unsectored memory • Needs paging management • In UNIX needs to change OS kernel • In Windows , the code is not public, needs external driver software
References • MXT 1- High-throughput coherence control and hardware messaging in EverestA. K. Nanda, A.-T. Nguyen, M. M. Michael, and D. J. Josephp. 229 2- Algorithms and data structures for compressed-memory machinesP. A. Franaszek, P. Heidelberger, D. E. Poff, and J. T. Robinsonp. 245 3- On internal organization in compressed random-access memoriesP. A. Franaszek and J. T. Robinsonp. 259 IBM Memory Expansion Technology (MXT)R. B. Tremaine, P. A. Franaszek, J. T. Robinson, C. O. Schulz, T. B. Smith, M. E. Wazlowski, and P. M. Blandp. 271 4- Memory Expansion Technology (MXT): Software support and performanceB. Abali, H. Franke, D. E. Poff, R. A. Saccone, Jr., C. O. Schulz, L. M. Herger, and T. B. Smithp. 287 5- Memory Expansion Technology (MXT): Competitive impactT. B. Smith, B. Abali, D. E. Poff, and R. B. Tremainep. 303 • Memory Compression http://domino.research.ibm.com/comm/wwwr_thinkresearch.nsf/pages/memory200.html • Memory Guide http://www.pcguide.com/ref/ram/tech.htm