Daily Archives: October 21, 2023

Apache Spark 101: Read Modes

Apache Spark, one of the most powerful distributed data processing engines., provides multiple ways to handle corrupted records during the read process. These methods, known as read modes, allow users to decide how to address malformed data. This article will delve into these read modes, providing a comprehensive understanding of their functionalities and use cases.


Permissive Mode (default):

Spark adopts a lenient approach to data discrepancies in the permissive mode, which is the default setting.

  • Handling of Corrupted Records: Spark will set all fields to null for that specific record upon encountering a corrupted record. Moreover, the corrupted records get allocated to a column named. _corrupt_record.
  • Advantage: This ensures that Spark continues processing without interruption, even if it comes across a few corrupted records. It’s a forgiving mode, handy when data integrity is not the sole priority, and there’s an emphasis on ensuring continuity in processing.

DropMalformed Mode:

As the title suggests, this mode is less forgiving than permissive. Spark takes stringent action against records that don’t match the schema.

  • Handling of Corrupted Records: Spark directly drops rows that contain corrupted or malformed records, ensuring only clean records remain.
  • Advantage: This mode is instrumental if the objective is to work solely with records that align with the expected schema, even if it means discarding a few anomalies. If you aim for a clean dataset and are okay with potential data loss, this mode is your go-to.

FailFast Mode:

FailFast mode is the strictest among the three and is for scenarios where data integrity cannot be compromised.

  • Handling of Corrupted Records: Spark immediately halts the job in this mode and throws an exception when it spots a corrupted record.
  • Advantage: This strict approach ensures unparalleled data quality. This mode is ideal if the dataset must strictly adhere to the expected schema without discrepancies.

To cement the understanding of these read modes, let’s delve into a hands-on example:

from pyspark.sql import SparkSession
import getpass

username = getpass.getuser()
spark = SparkSession.builder \
.config('spark.ui.port', '0') \
.config("spark.sql.warehouse.dir", f"/user/{username}/warehouse") \
.enableHiveSupport() \
.master('yarn') \
.getOrCreate()
# Expected Schema
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

expected_schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])

print("Expected Schema:\n", expected_schema)
print("\n")
# Permissive Mode
print("Permissive Mode:")
print("Expected: Rows with malformed 'age' values will have null in the 'age' column.")
dfPermissive = spark.read.schema(expected_schema).option("mode", "permissive").json("sample_data_malformed.json")
dfPermissive.show()
# DropMalformed Mode
print("\nDropMalformed Mode:")
print("Expected: Rows with malformed 'age' values will be dropped.")
dfDropMalformed = spark.read.schema(expected_schema).option("mode", "dropMalformed").json("sample_data_malformed.json")
dfDropMalformed.show()
# FailFast Mode
print("\nFailFast Mode:")
print("Expected: Throws an error upon encountering malformed data.")
try:
dfFailFast = spark.read.schema(expected_schema).option("mode", "failFast").json("sample_data_malformed.json")
dfFailFast.show()
except Exception as e:
print("Error encountered:", e)

A Note on RDDs versus DataFrames/Datasets:

The discussion about read modes (Permissive, DropMalformed, and FailFast) pertains primarily to DataFrames and Datasets when sourcing data from formats like JSON, CSV, Parquet, and more. These modes become critical when there’s a risk of records not aligning with the expected schema.

Resilient Distributed Datasets (RDDs), a foundational element in Spark, represent a distributed set of objects. Unlike DataFrames and Datasets, RDDs don’t possess a schema. Consequently, when working with RDDs, it’s more about manually processing data than relying on predefined structures. Hence, RDDs don’t intrinsically incorporate these read modes. However, these modes become relevant when transitioning data between RDDs and DataFrames/Datasets or imposing a schema on RDDs.

Understanding and choosing the appropriate read mode in Spark can significantly influence data processing outcomes. While some scenarios require strict adherence to data integrity, others prioritize continuity in processing. By providing these read modes, Spark ensures that it caters to a diverse range of data processing needs. The mode choice should always align with the overarching project goals and data requirements.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.