Apache Spark - Apache Spark Tutorial

Writing Spark DataFrame to HBase with Hortonworks

Leave a Comment / Apache Spark / By Editorial Team

Apache Spark is a powerful open-source distributed computing system that provides a fast and general-purpose cluster-computing framework. It’s often used for handling big data analysis. Apache HBase is a scalable, distributed, and NoSQL database built on top of Hadoop. It excels in providing real-time read/write access to large datasets. Hortonworks Data Platform (HDP) is a …

Writing Spark DataFrame to HBase with Hortonworks Read More »

Spark SQL Convert Date to String Format

Leave a Comment / Apache Spark / By Editorial Team

Apache Spark is an open-source, distributed computing system that provides an easy-to-use and powerful interface for processing large datasets. Spark SQL is one of its components that allows querying data via SQL as well as the Apache Hive variant of SQL — called the Hive Query Language (HQL) — and it integrates with the datasets …

Spark SQL Convert Date to String Format Read More »

Spark Date Functions: Handling Month’s Last Day

Leave a Comment / Apache Spark / By Editorial Team

Working with date and time is a common yet critical aspect of data analysis and processing tasks. In data engineering and analytics, handling time series data often requires dealing with the special case of determining the last day of a month, which may vary from 28 to 31 days depending on the month and whether …

Spark Date Functions: Handling Month’s Last Day Read More »

Reading and Writing XML Files with Spark

Leave a Comment / Apache Spark / By Editorial Team

Apache Spark is a powerful open-source distributed computing system that provides fast and general-purpose cluster-computing capabilities. Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming. One of the common tasks while working with Spark is processing data in different formats, including XML (eXtensible Markup …

Reading and Writing XML Files with Spark Read More »

Join Operations in Spark SQL DataFrames

Leave a Comment / Apache Spark / By Editorial Team

Apache Spark is a fast and general-purpose cluster computing system, which includes tools for managing and manipulating large datasets. One such tool is Spark SQL, which allows users to work with structured data, similar to traditional SQL databases. Spark SQL operates on DataFrames, which are distributed collections of data organized into named columns. Join operations …

Join Operations in Spark SQL DataFrames Read More »

Read Hive Tables with Spark SQL (Easy Guide)

Leave a Comment / Apache Spark / By Editorial Team

Apache Spark is a powerful open-source distributed computing system that provides fast and general-purpose cluster-computing capabilities. It is renowned for its ease of use in creating complex, multi-stage data pipelines and supporting a variety of data sources including Hive. Hive is a data warehouse software built on top of Apache Hadoop for providing data query …

Read Hive Tables with Spark SQL (Easy Guide) Read More »

Lineage Graph in Spark: An Overview

Leave a Comment / Apache Spark / By Editorial Team

Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. At the heart of its architecture lies a fundamental concept known as the lineage graph, which is an essential feature that provides Spark with efficient fault recovery and optimization mechanisms. This overview …

Lineage Graph in Spark: An Overview Read More »

Spark SQL String Functions : Your Guide to Efficient Text Data Handling

Leave a Comment / Apache Spark / By Editorial Team

Apache Spark SQL is a powerful tool for processing structured data. Spark SQL provides a wide array of functions that can manipulate string data efficiently. String functions in Spark SQL offer the ability to perform a multitude of operations on string columns within a DataFrame or a SQL query. These functions include operations like comparing …

Spark SQL String Functions : Your Guide to Efficient Text Data Handling Read More »

Add Multiple Jars to spark-submit Classpath

Leave a Comment / Apache Spark / By Editorial Team

When working with Apache Spark, it becomes essential to understand how to manage dependencies and external libraries effectively. Spark applications can depend on third-party libraries or custom-built jars that need to be available on the classpath for the driver, executors, or both. This comprehensive guide will discuss the various methods and best practices for adding …

Add Multiple Jars to spark-submit Classpath Read More »

Checking for Column Presence in Spark DataFrame

Leave a Comment / Apache Spark / By Editorial Team

When working with large datasets, particularly in the context of data transformation and analysis, Apache Spark DataFrames are an invaluable tool. However, as data comes in various shapes and forms, it is often necessary to ensure that particular columns exist before performing operations on them. Checking for column presence in a Spark DataFrame is a …

Checking for Column Presence in Spark DataFrame Read More »