Stencil approaches are commonly used in HPC applications. It is an algorithm that updates elements in a multidimensional grid using a fixed pattern (the stencil) depending on surrounding values. In fluid mechanics, weather prediction, and seismic imaging, they’re utilized to solve a variety of partial differential equations (PDEs).
For high-order systems like the stencil algorithms explained later in this article, the calculation reads all the variables in memory but only uses them for a few arithmetic operations. A large amount of input data is required for each operation, while a small amount of time is spent computing the result. The nearest-neighbor approach used by stencil algorithms does not transfer well to DRAM reads from large contiguous memory areas. Hierarchical memory architectures are generally recognized to be incompatible with these problems. These architectures are perfect candidates for computationally expensive applications , where each data element read necessitates several floating-point calculations.
Inkjet methods are less memory- and computation-intensive than earlier architectures; therefore, performance is limited only by the speed of data transmission between the processor and memory. Memory-bound people often suffer from the following:
Increasing the processing unit’s clock speed will not improve the problem.By increasing the amount of processing power, scalability difficulties arise. The performance of memory-bound applications is highly dependent on the data transfer rate between processing units. To boost processing power, it is standard practice to link many devices via an interconnect. When data is transported from one device to another, the connection is slower than the bandwidth of the fabric, resulting in a delay. Scaling issues arise when trying to fix a problem by increasing the amount of processing power).
Wafer-Scale Engine (WSE), which packs 850,000 cores onto one piece of silicon, is used in the endeavor to construct a stencil algorithm on the Cerebras CS-2 System. The method was constructed using the Cerebras Software Language (CSL), a component of the Cerebras Software Development Kit. The result is astounding when the WSE’s enormous memory bandwidth is paired with very efficient neighbor to neighbor communication and smart algorithm implementation. Another alternative is to accept the data transmission requirements and build the algorithm to leverage Cerebras’ hardware bandwidth.
As a public benchmark case for evaluating new hardware solutions, TotalEnergies developed the test issue (Minimod). This study aims to solve the isotropic acoustic kernel in a region with constant densities. Finite differences are created by discretizing the equation with a 25-point stencil (FD). A discretized space is one in which every point has four neighbors in every dimension.
CS-2 is powered by the Wafer-Scale Engine (WSE) and on-wafer, completely distributed memory engine. Custom-designed localized broadcast patterns underpin the approach, which allows data to be sent, received, and computed simultaneously on the device’s hardware. As a result, data may be swiftly transferred between Processing Elements (PEs) near one another. Simply reducing the third dimension results in a two-dimensional representation of the 3D domain.
Comparing only one CS-2 or one A100, the paper does not consider any host communication. The A100s in the TotalEnergies cluster each have 40GB of on-device RAM. The number of processing components increases in direct proportion to the magnitude of the problem. There are 1,000 attempts at solving the challenge.
The WSE-2 outperforms the A100 by more than 220 times in the most important size category. WSE-2 takes the same amount of time regardless of the problem’s complexity, which indicates that it is computationally constrained. A scaling efficiency of more than 98 percent for all scales for the WSE-2 is virtually optimal. HPC experts will be amazed by both of these outcomes.
The implementation is compute-bound, as seen from a look at the roofline model. Single-node throughput of 503 TFLOPs is a spectacular achievement for the WSE-2 technology. With these discoveries, HPC applications running on the WSE-2 will have many more options in the future. There is a lot of interest in increasingly complicated applications, including stencil-based and ML-hybridized ones, because of the WSE-2’s ability to handle those workloads.