What is the definition of fast, for code? Fast compared to what? Modern C++ compilers, when used carefully, can take code that solves a given problem and produce something that is pretty fast when compared to other compilers. Rust appears to be catching up in terms of speed and, in some cases, can perform better. But, what can be gained if we constrain the code we write, or focus on specific classes of problems?

We developers almost always take two facts for granted when thinking about programming languages. First, we assume a Turing-complete language. Second, we assume that there will either be malloc/free or a garbage collector. Interestingly, if these constraints are relaxed one can build a compiler that significantly outperforms even C++ or Rust for problems that can be transformed into the domain of the compiler.

There is a thread of research into these types of compilers that lately has produced some really interesting results. For instance, Google now writes nearly all of their image processing algorithms using one of these compilers and Amazon has been sponsoring a open-source project to build a compiler specifically for numeric computing and neural networks, both of which are based on this foundation. With big players interested in these ideas, a few words about the foundation of these new compilers will be worthwhile.

#### Grids of computations

High performance computing designs often boil down to a grid of computations. These grids are often multi-leveled and each level has a specific number of dimensions. Thinking of a cube of cubes may help visualize these topologies. For example, the outer (Grid) cube may have dimensions 4x4x4 and each inner (Thread) cube may have dimensions 2x2x2. This notion accurately describes many processes running on supercomputers, maybe with an another outer cube. In full generality, this type of model can also describe GPU, and even typical multi-processor CPU systems (where communication between cache hierarchies is expensive).

As a concrete example, here is a diagram of NVIDIA's CUDA architecture:

Notably, the cost of communication increases with distance in such a system. Register access is fastest, communicating within 1 inner cube is fairly cheap, but communicating between 2 inner cubes is not so cheap, especially if the communication requires synchronization. In a supercomputer communication between the outer cube levels probably involves another significant jump in latency, and channels may not exist between any 2 outer cubes thus necessitating a multi-hop communication channel (also slow).

The relaxation of Turing-completeness comes in the form of disallowing infinite recursion. Then, we aim to describe the expensive parts of our computation as a sequence of nested loop expressions operating on a bounded heap with known (potentially pre-allocated) sizes. The whole situation is called an 'operation'.

**Operation**: A nested loop over `Y`

domains potentially producing multiple results. Each element of any result is calculated at each iteration tuple; so the results have `Y`

dimensions.

Generally, an operation operates as either a `reduce`

(like a dot product) or a `scan`

(parallel prefix scan). Thus we can express most linear algebraic operations. Good examples include matrix multiply, or neural network style convolution. The point is that these operations can be described in a fundamentally functional manner. For years I have believed that CUDA is too low level and that as a language it locks people into a set of sub-optimal design decisions due to its imperative nature. Until recently, compiler technology has lagged, so CUDA remains a reasonable choice to solve the wide range of problems that NVIDIA designed it for. Future compilers address some of these concerns.

We know from auto parallelization research that dependency analysis and alias analysis are critical to parallelization. Without describing the problem in a functional manner we limit our ability to do such analysis. We also know from years of experience that a sufficiently smart compiler is impossible at our current levels of technical expertise.

#### No compiler for old nerds

Luckily for us, we can still take the middle ground of using the 'human compiler' for the high level direction, while organizing problems so that the software compiler can reorganize computations for fast execution. The developer can schedule the computations and the compiler can reorganize the details of loop iteration. Ultimately, the high-level direction could be decided by something more automated, such as gridsearch, hyper-parameter optimization, or a genetic algorithm.

The compiler still has a lot of work to do. It very well may want to reorganize caching or vectorization depending on whether we are running on TPU or an ARM with Neon optimizations or an FPGA. The hardware architecture may require nontrivial reordering of our computations in order to get good performance but doing this in C++ by hand could introduce bugs.

In conclusion:

- By constraining the language we provide a more suitable substrate for operation optimization
- Describing computations in terms of nested iteration through
`Y`

domains allows us to express computations that map nicely to grids of computation - Optimal grid design and communication cost may vary widely between architectures
- Ideally, we want to separate the description of our algorithm its mapping to hardware - then, we can vary that mapping without introducing algorithmic errors

#### The anatomy of optimization

What do you think is the most expensive part of calculating an answer on a GPU? Do you think that adding 2 numbers (ALU or FPU) operations are the most expensive part? What about fetching the operands from main memory? What if they are in on-chip cache? What if they are stored in registers?

In practice, moving data between host and GPU memory is generally the most expensive step. The hardware fast enough that recalculating an answer can be faster than fetching from main memory.

Abstractly, we can break optimization problems down into 3 axes; namely Parallelism, Locality, and Redundancy. We face trade-offs, and navigating this space can involve exchanging cacheing with recomputation. Details of the underlying hardware architecture will determine the optimal solution. Optimization may realize significant performance gains.

An example from Jonathan Ragan-Kelley's 2014 Ph.D. thesis:

The motivating problem is a simple 3x3 box blur on an image, which resembles a neural network's convolutional layer. The solution involves applying a 3x3 kernel that sums neighboring pixels and then divides by nine. This blurs the image.

- Parallelism - how many units of execution can usefully attack the problem simultaneously? Maximizing parallelism incurs a cost in the other 2 dimensions. In the blur example, increasing parallelism to 100% involves independently recalculating the entire 3x3 matrix for each output pixel. So, each input pixel is read roughly 9 times. This also increases redundancy by recalculating lots of partial values in both the
`x`

and`y`

dimensions. - Locality - how far data must move in order to be used? It is possible to calculate the answer after reading each input pixel exactly once. In this case, locality is increased at the cost of reducing parallelism and with only moderate redundancy.
- Redundancy - can recalculation be avoided? One strategy is to cache an entire intermediate image blurred in only the
`x`

or`y`

dimension. This probably involves reading a complete image of data twice from RAM*but*some intermediate values no longer need to be recalculated. There is some overlap in computation, but much less than the maximally parallelized case. So, reducing redundancy can involve reducing parallelism.

#### Taking a breath

This post covers a lot of ground. The main takeaway is that by relaxing some of the constraints placed on general purpose compilers (e.g., C++ and Rust) opens the door for high performance compilers that are capable of generating faster code for a subset of problems. In particular, separating the algorithm definition from hardware scheduling can enable a single algorithm to be efficiently executed on a variety of hardware architectures. Doing this involves understanding a more abstract set of scheduling primitives that can map to different systems of computation.

The next post in this series outlines a concrete example that applies these concepts. First, the data-intensive core of the problem is identified and a batching strategy is devised. Then, for each platform we care about, we work through the details of the algorithm in terms of the concepts in this post, paying careful attention to how different possible solutions will perform.

Further reading:

Going deep with the TechAscent team!