Fundamentals of Apache Spark — Interview Questions & Answers | Meritshot Interview Guides

Architecture & Core Concepts

1. What is Apache Spark?

Apache Spark is an open-source, distributed computing framework designed for fast, large-scale data processing. It was developed at UC Berkeley and donated to the Apache Software Foundation. Spark processes data in-memory across a cluster, making it significantly faster than disk-based systems like Hadoop MapReduce for iterative algorithms and interactive queries. It supports batch processing, stream processing, machine learning (MLlib), graph processing (GraphX), and SQL queries (Spark SQL) through a unified API.

2. How does Spark compare to Hadoop MapReduce?

Spark is generally 10x to 100x faster than Hadoop MapReduce for most workloads because it processes data in-memory rather than writing intermediate results to disk between stages. MapReduce writes to HDFS after every map and reduce phase, creating significant I/O overhead. Spark also has a richer programming model with transformations and actions, a unified API for batch and streaming, and interactive shells. However, Spark requires more RAM and MapReduce is better suited for single-pass batch jobs with very large data that exceeds cluster memory.

3. What is the Spark architecture?

Spark follows a master-worker architecture. The Driver Program hosts the SparkContext, coordinates the job, and converts user code into tasks. The Cluster Manager (Standalone, YARN, Mesos, or Kubernetes) allocates resources across the cluster. Worker Nodes host Executors, which are JVM processes that run tasks and store RDD partitions in memory or on disk. The DAG Scheduler converts transformations into a Directed Acyclic Graph of stages, and the Task Scheduler assigns tasks to executors.

4. What is a SparkSession?

SparkSession is the unified entry point for Spark functionality introduced in Spark 2.0, combining SparkContext, SQLContext, and HiveContext into a single object. It is created using SparkSession.builder.appName("app").getOrCreate(). From the SparkSession, you can create DataFrames, execute SQL queries, access the SparkContext for RDD operations, and configure Spark. In PySpark notebooks (such as Databricks), a SparkSession named spark is automatically available.

5. What is a DAG in Spark?

A DAG (Directed Acyclic Graph) is a logical execution plan that Spark creates from the chain of transformations before any actual computation. Nodes in the DAG represent RDDs or DataFrames and edges represent transformations. The DAG Scheduler optimises the execution plan by grouping transformations into stages separated by shuffle operations and then submits stages to the Task Scheduler. Viewing the DAG in the Spark UI helps understand the execution plan, identify bottlenecks, and debug performance issues.

6. What is lazy evaluation in Spark?

Lazy evaluation means that Spark does not execute transformations immediately when they are called — it only records the lineage (the sequence of operations). Computation begins only when an action (such as count(), collect(), write()) is triggered. This allows Spark to optimise the execution plan end-to-end by combining, reordering, and eliminating unnecessary steps. For example, filter() followed by select() can be pushed down before a join, reducing data shuffled across the network.

7. What is the difference between a transformation and an action?

A transformation creates a new RDD or DataFrame from an existing one without executing any computation — it is lazy. Examples include map(), filter(), select(), groupBy(), join(), and withColumn(). An action triggers the execution of the DAG and returns a result to the driver or writes data to storage. Examples include count(), collect(), show(), take(n), write(), saveAsTextFile(), and first(). Every Spark job consists of a sequence of transformations ending in one or more actions.

8. What are the different deployment modes in Spark?

Spark runs in three modes: Local mode runs entirely on a single machine without a cluster manager, used for development and testing. Client mode deploys the driver on the machine that submits the job, so the client machine must stay running for the job's duration — used for interactive sessions. Cluster mode deploys the driver inside the cluster, allowing the submitting machine to disconnect after submission — used for production batch jobs. YARN, Kubernetes, and Standalone cluster managers support both client and cluster modes.

9. What is the Catalyst Optimizer?

The Catalyst Optimizer is Spark SQL's query optimisation framework that automatically transforms a logical query plan into an efficient physical execution plan. It performs rule-based optimisations such as predicate pushdown (filtering data as early as possible), column pruning (selecting only needed columns), constant folding (evaluating constant expressions at compile time), and join reordering. Catalyst also enables cost-based optimisation when statistics are available. It runs transparently on all DataFrame and SQL operations.

10. What is the Tungsten Execution Engine?

Tungsten is Spark's physical execution engine that improves CPU and memory efficiency. It uses off-heap binary memory management (bypassing JVM overhead and garbage collection), cache-aware computation that minimises cache misses, whole-stage code generation that compiles multiple operators into a single JVM function, and vectorised execution for columnar data. Together, these optimisations make Spark significantly faster for compute-intensive operations like sorting, hashing, and aggregation, especially for large datasets.

RDDs & DataFrames

11. What is an RDD?

An RDD (Resilient Distributed Dataset) is Spark's fundamental data abstraction — an immutable, fault-tolerant, distributed collection of elements partitioned across the nodes of a cluster. Resilient means it can be recomputed from its lineage if a partition is lost. Distributed means data is spread across multiple nodes. Dataset refers to the collection of records. RDDs support two types of operations: transformations (which create new RDDs) and actions (which return values). They are the lowest-level Spark API and are largely superseded by DataFrames.

12. What is the difference between RDD, DataFrame, and Dataset?

RDDs are untyped, unstructured collections without schema information, giving full flexibility but no automatic optimisation. DataFrames are distributed collections of data organised into named columns (like SQL tables), with schema information that enables Catalyst optimisation and Tungsten execution. Datasets combine type-safety with the optimisation benefits of DataFrames — they are available in Scala and Java but in Python, DataFrames and Datasets are unified (PySpark only has DataFrames). DataFrames are generally preferred for their performance and ease of use.

13. How do you create a DataFrame in PySpark?

DataFrames are created from existing data using spark.createDataFrame(data, schema) where data is a list of tuples or rows. From a Pandas DataFrame: spark.createDataFrame(pandas_df). From files: spark.read.csv('path', header=True, inferSchema=True), spark.read.json('path'), spark.read.parquet('path'), or spark.read.format('delta').load('path') for Delta Lake. From SQL: spark.sql('SELECT * FROM table'). inferSchema=True automatically detects data types but requires an extra pass over the data.

14. What is a partition in Spark?

A partition is a logical chunk of data within an RDD or DataFrame that is processed by a single task on a single executor. The number of partitions determines parallelism — more partitions mean more tasks that can run in parallel. By default, Spark creates one partition per HDFS block (128MB) or per CPU core in local mode. rdd.getNumPartitions() shows the current count. df.repartition(n) creates exactly n partitions (full shuffle), while df.coalesce(n) reduces partitions without a full shuffle. Optimal partition count is typically 2-4× the number of available cores.

15. What is the difference between `repartition()` and `coalesce()`?

repartition(n) performs a full shuffle to redistribute data evenly across exactly n partitions. It is used to increase or decrease partitions and is required before writing to ensure evenly-sized output files. coalesce(n) reduces partitions by combining existing partitions without a full shuffle — only moving the minimum necessary data. It is more efficient than repartition() for reducing partition count but cannot increase it and may produce uneven partitions. Use coalesce() to reduce before writing if data is already evenly distributed.

16. What is persistence (caching) in Spark?

df.cache() or df.persist(StorageLevel.MEMORY_AND_DISK) stores the DataFrame or RDD in memory (and optionally on disk) after the first action computes it, so subsequent actions can reuse the cached data without recomputation. This is critical for iterative algorithms (machine learning) or when the same DataFrame is used multiple times in a job. df.unpersist() releases cached data. Storage levels include MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, and serialised variants. Over-caching wastes memory and can degrade performance.

17. What are wide and narrow transformations?

A narrow transformation processes each partition independently without requiring data from other partitions — examples include map(), filter(), select(), and withColumn(). No shuffle is required, so they are fast. A wide transformation requires data from multiple partitions to be shuffled across the network — examples include groupByKey(), reduceByKey(), join(), distinct(), and orderBy(). Wide transformations mark stage boundaries in the DAG and are the primary source of Spark performance bottlenecks.

18. How do you read and write Parquet files in Spark?

Parquet is read using spark.read.parquet('path/to/file.parquet') or spark.read.format('parquet').load('path'). It is written using df.write.parquet('output/path') or df.write.format('parquet').save('output/path'). Parquet is the preferred format in Spark because it is columnar (enabling column pruning), splittable (enabling parallel reads), and has excellent compression. df.write.mode('overwrite') replaces existing data. df.write.partitionBy('year', 'month').parquet(path) creates partitioned directories for efficient predicate pushdown filtering.

19. How do you handle schema evolution in Spark?

Schema evolution (adding or removing columns from files over time) is handled using spark.read.option('mergeSchema', 'true').parquet(path) to union schemas from multiple Parquet files. Delta Lake tables handle schema evolution natively: df.write.format('delta').option('mergeSchema', 'true').mode('append').save(path). StructType can be defined explicitly to ensure consistent schema across reads: spark.read.schema(defined_schema).csv(path). Column casting with df.withColumn('col', df['col'].cast('integer')) resolves type mismatches.

20. What is a schema in Spark and how do you define one?

A schema defines the structure of a DataFrame — the column names, data types, and nullable flags. Schemas are defined using StructType and StructField: StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), False)]). Explicit schemas are preferred over inferSchema=True for production code because they are faster (no extra data scan), more reliable, and prevent silent type coercion issues. df.schema returns the current schema, df.printSchema() displays it as a tree.

Spark SQL

21. How do you run SQL queries in Spark?

SQL queries are run by first registering a DataFrame as a temporary view: df.createOrReplaceTempView('sales'), then querying it with spark.sql('SELECT region, SUM(amount) FROM sales GROUP BY region'). createGlobalTempView('name') creates a view accessible across SparkSessions (referenced as global_temp.name). Permanent tables can be created and queried from the Spark catalog. The result is a new DataFrame that can be further transformed. Spark SQL supports ANSI SQL and HiveQL syntax.

22. What is the difference between `createTempView()` and `createOrReplaceTempView()`?

createTempView('name') registers the DataFrame as a temporary view but throws an error if a view with that name already exists. createOrReplaceTempView('name') either creates a new view or replaces an existing one with the same name without error — it is the safe, idempotent option used in production and notebooks where code is re-run. Temporary views are session-scoped (lost when the SparkSession ends) and are not visible to other SparkSessions. They exist only in memory and have no storage overhead.

23. What are window functions in Spark SQL?

Window functions perform calculations across a set of rows related to the current row, defined by a window specification. from pyspark.sql.window import Window is imported, then windowSpec = Window.partitionBy('dept').orderBy('salary'). rank(), dense_rank(), row_number(), lag(), lead(), sum(), avg(), and ntile() are applied over the window: df.withColumn('rank', rank().over(windowSpec)). They are used for ranking, running totals, moving averages, and identifying previous/next row values.

24. How does `groupBy().agg()` work in PySpark?

df.groupBy('region', 'product').agg(F.sum('amount').alias('total'), F.avg('amount').alias('avg'), F.count('*').alias('count')) groups by multiple columns and applies multiple aggregate functions simultaneously. F is typically from pyspark.sql import functions as F. Named aggregate functions include F.sum(), F.avg(), F.count(), F.min(), F.max(), F.stddev(), F.collect_list(), and F.collect_set(). The agg() approach is preferred over chaining multiple aggregations for readability and performance.

25. What is the difference between `filter()` and `where()` in Spark?

df.filter(condition) and df.where(condition) are exactly equivalent in Spark — where() is simply an alias for filter() provided for SQL familiarity. Both accept a column expression (df['age'] > 30), a string SQL expression ("age > 30"), or a Column object. They apply the same Catalyst optimisation and predicate pushdown. The choice is purely stylistic — where() reads more naturally alongside SQL syntax while filter() is more consistent with functional programming style.

26. How do you perform a join in PySpark?

df1.join(df2, on='key', how='inner') performs a join on a common column. For different column names: df1.join(df2, df1['id'] == df2['customer_id'], 'left'). Join types include inner, left, right, full (outer), cross, left_semi (keep left rows with a match), and left_anti (keep left rows with no match). Multiple join keys use a list: on=['key1', 'key2']. Broadcast joins (F.broadcast(df)) force the smaller table to be sent to all executors, avoiding shuffle for small dimension tables.

27. What is a broadcast join and when should it be used?

A broadcast join sends a small DataFrame to every executor in the cluster so the join can be performed locally without shuffling the large DataFrame. It is triggered automatically when one DataFrame is smaller than spark.sql.autoBroadcastJoinThreshold (default 10MB) or explicitly with df1.join(F.broadcast(df2), 'key'). Broadcast joins are significantly faster than sort-merge joins for large-to-small table joins. They should not be used when the small table is still too large to fit in executor memory.

28. What is the difference between `sort()` and `orderBy()`?

df.sort() and df.orderBy() are aliases and functionally identical — both sort the entire DataFrame globally across partitions, triggering a full shuffle. df.sort('col', ascending=False) or df.orderBy(F.desc('col')) sort in descending order. df.sortWithinPartitions('col') sorts within each partition locally without a shuffle — useful when you only need sorted order within partitions (e.g., before a mapPartitions operation). Global sort should be used sparingly as it is expensive on large datasets.

29. How do you handle null values in PySpark DataFrames?

Null values are detected with df.filter(df['col'].isNull()) or df.filter(df['col'].isNotNull()). They are dropped using df.dropna() (all columns) or df.dropna(subset=['col1', 'col2']). They are filled using df.fillna({'col1': 0, 'col2': 'Unknown'}). df.na.fill() and df.na.drop() are equivalent methods. F.coalesce(df['col1'], df['col2'], F.lit(0)) returns the first non-null value. F.when(df['col'].isNull(), 0).otherwise(df['col']) provides conditional null replacement.

30. What are User Defined Functions (UDFs) in Spark?

A UDF is a custom Python or Scala function registered with Spark to be applied to DataFrame columns. In PySpark: from pyspark.sql.functions import udf; double_udf = udf(lambda x: x * 2, IntegerType()); df.withColumn('doubled', double_udf(df['col'])). UDFs are flexible but bypass Catalyst and Tungsten optimisations, resulting in significantly slower performance than built-in functions. Pandas UDFs (@pandas_udf) use Apache Arrow for vectorised execution and are much faster. Always prefer built-in F.* functions over UDFs when possible.

Streaming & Real-Time

31. What is Spark Structured Streaming?

Spark Structured Streaming is a scalable, fault-tolerant stream processing engine built on Spark SQL. It treats a live data stream as an unbounded table that is continuously appended with new rows. Users write the same DataFrame and SQL API they use for batch processing. The engine handles fault tolerance using checkpointing and write-ahead logs, guarantees exactly-once processing semantics, and supports sources including Kafka, file systems, and Delta Lake. It was introduced in Spark 2.0 to replace the older DStreams API.

32. What are the output modes in Structured Streaming?

Structured Streaming supports three output modes. Complete mode writes the full result table to the sink on every trigger — suitable for aggregations where the complete result changes. Append mode (default) writes only newly added rows since the last trigger — suitable for non-aggregating queries and append-only sinks. Update mode writes only rows that have changed since the last trigger — suitable for aggregations with sinks that can handle updates. Not all modes are supported for all query types.

33. What is a watermark in Structured Streaming?

A watermark specifies the threshold for how late data can arrive and still be included in an aggregation. df.withWatermark('timestamp', '10 minutes').groupBy(F.window('timestamp', '5 minutes'), 'device').count() tells Spark to allow data up to 10 minutes late, after which state for that window can be dropped. Without watermarks, state grows unboundedly as Spark cannot know when late data might still arrive. Watermarks enable Spark to clean up state and are required for stateful streaming operations in append or update mode.

34. What is checkpointing in Spark Streaming?

Checkpointing saves the streaming query's progress (offsets processed) and intermediate state (aggregation state) to a durable storage location (HDFS, S3, ADLS) so the query can recover from failures without data loss or re-processing. It is configured with .option('checkpointLocation', 'path') in the query writer. Without checkpointing, if the Spark application restarts, the stream would either reprocess all data from the beginning (if offsets are not saved) or lose accumulated state. Checkpointing is mandatory for all stateful streaming queries.

35. How do you integrate Spark with Apache Kafka?

Kafka integration uses the spark-sql-kafka package. Reading a Kafka stream: spark.readStream.format('kafka').option('kafka.bootstrap.servers', 'host:9092').option('subscribe', 'topic').load(). The result has columns key, value (both binary), topic, partition, offset, and timestamp. Values are decoded with F.col('value').cast('string') or JSON parsing with from_json. Batch reading uses spark.read.format('kafka') with startingOffsets and endingOffsets options for bounded reads.

Performance & Optimisation

36. What causes data skew and how do you handle it?

Data skew occurs when certain keys have disproportionately more data than others, causing some tasks to take much longer than the rest (stragglers). Common causes include highly popular keys in joins or groupBy operations. Solutions include salting the skewed key by appending a random number to distribute it across more partitions and then removing the salt after aggregation, using broadcast joins for the small side of a skewed join, repartitioning with repartition() on a better-distributed column, or using adaptive query execution (AQE) in Spark 3.0+.

37. What is Adaptive Query Execution (AQE)?

AQE (enabled with spark.sql.adaptive.enabled = true in Spark 3.0+) dynamically optimises the execution plan based on runtime statistics rather than static estimates. Key features include dynamically coalescing post-shuffle partitions (reducing small output partitions), dynamically switching join strategies (converting a sort-merge join to a broadcast join if one side turns out to be small), and dynamically handling data skew by splitting skewed partitions. AQE significantly improves performance without manual tuning for many workloads.

38. What is the difference between `cache()` and `persist()`?

df.cache() is equivalent to df.persist(StorageLevel.MEMORY_AND_DISK) — both store the DataFrame in memory (deserialized JVM objects) and spill to disk if memory is insufficient. persist() allows specifying a custom storage level: MEMORY_ONLY (evicts instead of spilling), MEMORY_ONLY_SER (serialized, slower access but lower memory), DISK_ONLY, and OFF_HEAP. cache() is the convenience method for the most common case. Both are lazy — the data is actually cached the first time an action is triggered after the call.

39. What are shuffle partitions and how do you tune them?

Shuffle partitions determine the number of partitions after wide transformations (joins, groupBy) and are controlled by spark.sql.shuffle.partitions (default 200). For small datasets, 200 partitions creates many tiny tasks with excessive overhead. For large datasets, fewer than the optimal number causes large partitions and memory pressure. A good rule of thumb is to target partition sizes of 100-200MB. With AQE enabled, Spark 3.0+ coalesces small partitions automatically. For manual tuning: estimate total data size after shuffle ÷ 150MB gives approximate ideal partition count.

40. What is bucketing in Spark?

Bucketing pre-partitions data into a fixed number of buckets based on the hash of a column and stores it as files, eliminating shuffle for subsequent joins or groupBy operations on the same bucketed column. df.write.bucketBy(32, 'user_id').sortBy('user_id').saveAsTable('bucketed_sales') creates a bucketed table. When two tables are bucketed on the same column with the same number of buckets, joining them requires no shuffle. Bucketing is beneficial for tables that are repeatedly joined on the same key in an ETL pipeline.

41. How do you read only specific partitions from partitioned storage?

For data partitioned on disk (e.g., partitionBy('year', 'month').parquet(path)), Spark performs partition pruning when reading: spark.read.parquet('path').filter('year == 2024 and month == 1') reads only the year=2024/month=1 directory. This dramatically reduces I/O. Predicate pushdown must be enabled (spark.sql.parquet.filterPushdown = true, which is default). For Delta Lake and Hive-partitioned tables, partition statistics ensure only relevant files are scanned. Partitioning should be on low-to-medium cardinality columns commonly used in filters.

42. What is speculative execution in Spark?

Speculative execution (spark.speculation = true) detects slow-running tasks (stragglers) and launches duplicate copies of them on other executors. The first copy to complete is used and the others are killed. It mitigates transient node slowness, but can waste resources if tasks are slow due to data issues (skew) rather than hardware. It should not be used as a substitute for addressing root causes of slow tasks. Speculation checks are run at intervals controlled by spark.speculation.interval and trigger when a task is running significantly slower than the median.

43. What is garbage collection tuning in Spark?

Spark JVM applications can suffer from excessive garbage collection pauses, especially when many short-lived objects are created during transformations. Tuning involves using Kryo serialization (spark.serializer = org.apache.spark.serializer.KryoSerializer) instead of Java serialization to reduce object size, increasing the G1GC regions size, reducing the number of small objects through code changes, using off-heap memory (spark.memory.offHeap.enabled = true) to bypass GC for Tungsten operations, and monitoring GC time in the Spark UI's executor tab. High GC time relative to task time indicates memory pressure.

44. What are the best practices for writing optimised PySpark code?

Best practices include avoiding Python UDFs in favour of built-in pyspark.sql.functions, minimising collect() calls that bring data to the driver, filtering and selecting required columns as early as possible to reduce data volume, caching DataFrames used multiple times, using Parquet or Delta format for storage, enabling AQE, avoiding Cartesian products, using broadcast joins for small tables, partitioning data appropriately for reads and writes, profiling with the Spark UI, and using DataFrame API over RDDs for Catalyst and Tungsten optimisation benefits.

45. What is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID transactions, schema enforcement, time travel, and scalable metadata management to data lakes stored on cloud object storage (S3, ADLS, GCS). It stores data as Parquet files with a transaction log (_delta_log) recording every change. Key features include MERGE (upsert), UPDATE, DELETE operations not natively supported by plain Parquet, schema evolution with mergeSchema, and OPTIMIZE with Z-ordering for compacting small files and improving query performance. Delta Lake is the foundation of Databricks' Lakehouse architecture.

46. What is the difference between Delta Lake MERGE and UPSERT?

MERGE (also called upsert — update or insert) in Delta Lake conditionally updates existing records or inserts new ones based on a matching condition: deltaTable.alias('t').merge(updates.alias('s'), 't.id = s.id').whenMatchedUpdateAll().whenNotMatchedInsertAll().execute(). MERGE also supports conditional updates, deletes on match, and ignoring non-matching source rows. It atomically applies all changes within a single transaction, ensuring consistency. MERGE is critical for SCD (Slowly Changing Dimension) processing, CDC (Change Data Capture) pipelines, and deduplication.

47. What is time travel in Delta Lake?

Delta Lake's transaction log records every version of a table, enabling time travel — querying historical snapshots of data. spark.read.format('delta').option('versionAsOf', 5).load(path) reads version 5. spark.read.format('delta').option('timestampAsOf', '2024-01-15').load(path) reads as of a specific timestamp. DESCRIBE HISTORY delta_table shows all versions with timestamps, operations, and user info. Time travel enables auditing, debugging data pipeline issues by comparing versions, rolling back bad writes, and reproducing ML experiments.

48. How do you handle small files in Spark?

Small files (many files much smaller than HDFS block size) cause excessive metadata overhead and slow reads. They are produced by streaming workloads, over-partitioned writes, or many small appends. Solutions include using OPTIMIZE in Delta Lake to compact small files: OPTIMIZE delta_table ZORDER BY (user_id). In batch Spark, coalesce(n) or repartition(n) before writing controls output file count. Auto-compaction in Delta (optimizeWrite = true) intelligently compacts during writes. File size target is typically 128MB to 1GB depending on the workload.

49. What is Spark's memory model?

Spark's executor memory is divided into reserved memory (300MB for system), user memory (for UDFs and data structures, ~40% of remainder by default), execution memory (for shuffles, sorts, joins, ~60%), and storage memory (for caching). Execution and storage share a unified memory pool in Spark 1.6+, where each can use the other's space if idle, controlled by spark.memory.storageFraction. spark.executor.memory sets total executor memory. spark.executor.memoryOverhead adds off-heap memory for JVM overhead. Insufficient memory causes OOM errors or excessive spill to disk.

50. How do you debug a Spark application?

Debugging Spark applications uses the Spark Web UI (typically port 4040) to view active jobs, stages, tasks, executor metrics, GC time, and shuffle read/write sizes. df.explain() prints the logical and physical execution plan to identify potential issues. df.explain(mode='formatted') provides more detail including AQE plans. Logging uses log4j configuration. In Databricks, display events and profiling are built in. For production issues, enable spark.eventLog.enabled = true and use Spark History Server to review completed jobs. DAX Studio equivalent for Spark is the Spark UI's SQL tab.

Fundamentals of Apache Spark — Interview Questions & Answers