Cell is a microprocessor architecture jointly developed by a Sony, Toshiba, and IBM alliance known as STI. The architectural design and first implementation were carried out at the STI Design Center over a four-year period beginning March 2001 on a budget reported by IBM as approaching *]400 million.
Cell is a shorthand for Cell Broadband Engine Architecture, commonly abbreviated CBEA in full or Cell BE in part. Cell combines a general-purpose POWER-architecture core of modest performance with streamlined coprocessing elements which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation.
The major commercial application of Cell is in Sony's upcoming PlayStation 3 game console which is slated to launch in November 2006. It will also become available in a blade configuration from Mercury Computer Systems. Toshiba has announced plans to incorporate Cell in high definition television sets. Exotic features such as the XDR memory subsystem and coherent EIB interconnect appear to position Cell for future applications in the supercomputing space to exploit the Cell processor's prowess in floating point kernels.
The Cell architecture breaks ground in combining a light-weight general-purpose processor with multiple GPU-like coprocessors into a coordinated whole, a feat which involves a novel memory coherence architecture for which IBM received many patents.
The resulting architecture emphasizes efficiency/watt, prioritizes bandwidth over latency, and favors peak computational throughput over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development. IBM provides a comprehensive Linux-based Cell development platform to assist developers in confronting these challenges. Software adoption remains a key issue in whether Cell ultimately delivers on its performance potential.
In 2000, Sony Computer Entertainment, Toshiba Corporation, and IBM formed an alliance ("STI") to design and manufacture the processor.
The STI Design Center in Austin, Texas opened in March 2001. The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4 processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM's design centers.
During this period, IBM filed many patents pertaining to the Cell architecture, manufacturing process, and software environment.
On May 17, 2005, Sony Computer Entertainment confirmed some specifications of the Cell processor that would be shipping in the forthcoming PlayStation 3 console.
This Cell configuration will have one POWER processing element (PPE) on the core, with eight physical SPEs in silicon: one SPE is locked-out during the manufacturing test process—a practice which helps to improve yields—leaving seven SPEs operational in PS3 software. Sony has stated that one functional SPE will be reserved by the PS3 for the operating system and security functions, leaving six available for application code.
The target clock-frequency at introduction is 3.2 GHz. The introductory design is fabricated using a 90-nanometre SOI process, with initial volume production slated for IBM's facility in East Fishkill, New York.
On June 28 2005, IBM and Mercury Computer Systems announced a partnership agreement to build Cell-based computer systems for embedded applications such as medical imaging, industrial inspection, aerospace and defense, seismic processing, and telecommunications. In this Cell configuration all eight SPEs are functional.
In a simple analysis the Cell processor can be split into four components: external input and ouput structures, the main processor called the Power Processing Element (PPE) (a two-way SMT multithreaded Power 970 architecture compliant core), eight fully-functional co-processors called the Synergystic Processing Elements or SPEs and a specialised high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs, called the Element Interconnect Bus or EIB.
To achieve the high performance needed for mathematically intensive tasks such as decoding/encoding MPEG streams, generating or transforming three dimensional data or undertaking Fourier analysis of data the Cell processor simply marries the SPEs and the PPE via the EIB to give both access to main memory or other external data storage. The PPE which is capable of running a conventional operating system has control over the SPEs and can start, stop, interrupt and schedule processes running on the SPEs. To this end the PPE has additional instructions relating to control of the SPEs. Despite having Turing complete architectures the SPEs are not fully autonomous and require the PPE to initiate them before they can do any useful work. Most of the "horsepower" of the system comes from the synergistic processing elements.
The PPE and bus architecture includes various modes of operation giving different levels of protection, allowing areas of memory to be protected from access by specific processes running on the SPEs or PPE.
Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format. The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit VMX register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 128-bits in size or for SIMD computations on a variety of integer and floating point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values for a theoretic address range of 264 bytes. In practice, not all of these bits are implemented in hardware; the address space is extremely large nevertheless. Local store addresses internal to the SPU processor are expressed as a 32-bit word. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.
Modern graphics cards have multiple elements very similar to the SPE's, known as vertex shader units, with an attached high speed memory. Programs, known as shaders, are loaded onto the units to process the basic geometry fed from the computer's CPU, apply styles and display it.
The main differences are that the Cell's SPEs are much more general purpose than shader units, and the ability to chain the SPEs under program control offers considerably more flexibility, allowing the Cell to handle graphics, sound, or anything else.
While the Cell chip can have a number of different configurations, the basic configuration is composed of one "Power Processor Element" ("PPE") (sometimes called "Processing Element", or "PE"), and multiple "Synergistic Processing Elements" ("SPE") . The PPE and SPEs are linked together by an internal high speed bus dubbed "Element Interconnect Bus" ("EIB"). Due to the nature of its applications, Cell is optimized towards single precision floating point computation. The SPEs are capable of performing double precision calculations, albeit with an order of magnitude performance penalty. More general purpose computing tasks can be done on the PPE.
In one typical usage scenario, the system will load the SPEs with small programs (similar to threads), chaining the SPEs together to handle each step in a complex operation. For instance, a set-top box might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance. The PPE's VMX128 (AltiVec) unit is fully pipelined for double precision floating point and can complete two double precision operations per clock cycle, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle, which translates to 25.6 GFLOPS at 3.2 GHz.
Compared to a modern personal computer, the relatively high overall floating point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in desktop CPUs like the Pentium 4 and the Athlon 64. But, comparing only floating point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. Also, Cell is optimized for single-precision calculations; for double-precision, as used in personal computers, Cell performance drops by an order of magnitude to levels similar to desktops.
Recent tests by IBM * show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication.
The EIB is presently implemented as a circular ring comprised of four 16B-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature it is unrealistic to simply scale this number by processor clock speed. The arbitration unit imposes additional constraints which are discussed in the Bandwidth Assessment section below.
IBM Senior Engineer David Krolak, EIB lead designer, explains the concurrency model:
Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.
Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency.
Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.
David Krolak explains:
For the sake of quoting performance numbers, we will assume a Cell processor running at 3.2 GHz, the clock speed most often cited.
At this clock frequency each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300 GB/s". This number reflects the peak instantaneous EIB bandwidth blithely scaled by processor frequency.
However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explains:
This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.
In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.
To add further to the confusion, some older publications cite EIB bandwidth assuming a 4 GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384 GB/s and an arbitration-limited bandwidth figure of 256 GB/s.
All things considered the theoretic 204.8 GB/s number most often cited is the best one to bear in mind. The IBM Systems Performance group has demonstrated SPU-centric data flows achieving 197 GB/s on a Cell processor running at 3.2 GHz so this number is a fair reflection on practice as well.
The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 "lanes," each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point path are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.
IBM's H-series Blade servers will incorporate the cell processor as of March 2006.
Mercury Computer Systems, Inc. has released preproduction blades with cell microprocessors that are currently shipping.
Due to the flexible nature of the Cell, there are several possibilities for the utilization of its resources:
This actually provides a very flexible, yet powerful architecture for stream processing, allowing to explicitly schedule each SPE separately. Other processors are also able to perform this kind of processing but this comes often with limitations on the possible kernels to be loaded.
Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries. According to Sony, a compiler, debugger, IDE, performance analyzer, and Cell emulator should be made available soon. IBM has developed a pseudo-filesystem for Linux coined "Spufs" that simplifies access to and use of the SPE resources.
IBM is currently maintaining the Linux kernel and GDB ports, while Sony maintains the GNU toolchain (GCC, binutils). .
In November 2005, IBM released a "Cell Broadband Engine (CBE) Software Development Kit Version 1.0, consisting of a simulator and assorted tools, to its web site. Development versions of the latest kernel and tools for Fedora Core 4 are maintained at the Barcelona Supercomputing Center website*.
With the release of kernel version 2.6.16 on 20 March 2006, the Linux kernel officially supports the Cell processor.
Microprocessors | PowerPC microprocessors | Cell BE architecture
Cell | Cell | Cell | Cell (processeur) | Cell (processore) | Cell | Cell | Cell | Cell (processor) | Cell
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Cell microprocessor".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world