Apache Spark

Apache Spark Tutorial

Need to Know Your Spark Version? Here’s How to Find It

Apache Spark is a powerful distributed processing system used for big data workloads. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Knowing how to check the version of Spark you are working with is important, especially when integrating with different components, …

Need to Know Your Spark Version? Here’s How to Find It Read More »

Spark SQL Case When/Otherwise Example

Apache Spark SQL provides users with a rich toolkit for performing complex data manipulation and analysis, one of which is the conditional expressions. In this in-depth guide, we’ll cover the Spark SQL “CASE WHEN/OTHERWISE” syntax, which offers a powerful and flexible way to branch logic inside DataFrame transformations and queries, akin to the “if-then-else” statements …

Spark SQL Case When/Otherwise Example Read More »

Master Spark Filtering: startswith & endswith Demystified (Examples Included!)

When working with Apache Spark, manipulating and filtering datasets by string patterns becomes a routine necessity. Fortunately, Spark offers powerful string functions that allow developers to refine their data with precision. Among these functions are `startsWith` and `endsWith`, which are often employed to target specific textual patterns at the beginning or the end of a …

Master Spark Filtering: startswith & endswith Demystified (Examples Included!) Read More »

Setup Spark on Hadoop YARN – {Step By Step Guide}

Apache Spark has become one of the most popular frameworks for big data processing, thanks to its ease of use and performance advantages over traditional big data technologies. Spark can run on various cluster managers, with Hadoop YARN being one of the most common for production deployments due to its ability to manage resources effectively …

Setup Spark on Hadoop YARN – {Step By Step Guide} Read More »

Spark Sort multiple DataFrame columns with examples

DataFrames in Apache Spark are a distributed collection of data organized into named columns and are equivalent to tables in relational databases. When working with large datasets, it often becomes necessary to sort the data based on one or multiple columns to streamline downstream processing or to simply make the data more readable. In this …

Spark Sort multiple DataFrame columns with examples Read More »

Spark’s select vs selectExpr: A Comparison

Apache Spark is a powerful, distributed data processing system that allows for fast querying, analysis, and transformation of large datasets. Spark SQL is a Spark module for structured data processing, and within this framework, `select` and `selectExpr` are two pivotal methods used for querying data in Spark DataFrames. In this extensive comparison, we will delve …

Spark’s select vs selectExpr: A Comparison Read More »

Scroll to Top