Apache Spark: Introduction to project Tungsten

Introduction

Project Tungsten, first introduced with Spark 1.4 and further optimized in Spark 2.x, represents a second-generation engine engineered for significant performance enhancements. Tungsten, serving as a specialized compiler, transforms queries into optimized bytecode during runtime.

Tungsten’s magic lies in its ability to compile your queries or stages into a singular JVM bytecode function, thus increasing CPU efficiency and consequently enhancing overall performance.

While it is not necessary to understand Tungsten for efficient Spark programming, the underlying mechanisms and potential for advanced optimizations make Tungsten an intriguing and potentially invaluable component for those seeking to delve deeper into Spark.

After all, you never know when Tungsten might be the key factor that propels your next project to success!

Using Tungsten

By default, Tungsten is enabled. However, if you’re curious about the performance impact and wish to compare, you can disable it with the following command:

$ spark-shell --conf spark.sql.tungsten.enabled=false

And to re-enable it:

$ spark-shell --conf spark.sql.tungsten.enabled=true

Thomas Neumann’s Seminal VLDB 2011 Paper

For those who want to dive into efficient query plans for modern hardware, Thomas Neumann’s 2011 paper is an essential read. Available here: Efficiently compiling efficient query plans for modern hardware.

Eliminating Virtual Function Dispatches

Tungsten helps to eliminate virtual function dispatches, which can often be a bottleneck in runtime performance. By eliminating these, Tungsten contributes to creating a more streamlined execution process, enhancing efficiency.

Intermediate Data in CPU Registers

One of the innovative aspects of Tungsten is its ability to store intermediate data in CPU registers. This approach drastically reduces the need for memory access, which often proves to be a limiting factor in data processing tasks, thus enhancing overall performance.

SIMD

Tungsten supports Single Instruction, Multiple Data (SIMD) processing. SIMD is a type of parallel computing involving a single instruction stream and multiple data streams. Tungsten’s support for SIMD operations can provide a significant performance boost for certain types of data-parallel tasks.

Whole Stage Code Generation

Whole stage code generation is another key aspect of Tungsten. Instead of executing one operation at a time, this approach generates code for a whole “stage” of multiple operations. By doing so, Tungsten can better optimize the execution, reduce function calls, and keep more data in CPU registers, thereby improving performance.