Spark-Syntax
This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark for 3 years. This will mainly focus on the Spark DataFrames and SQL library.
you can also visit ericxiao251.github.io/spark-syntax/ for a online book version.
Contributing/Topic Requests
If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it 
If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it 
Acknowledgement
Huge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.
Table of Contexts:
Chapter 1 - Getting Started with Spark:
-  1.1 - Useful Material
-  1.2 - Creating your First DataFrame
-  1.3 - Reading your First Dataset
-  1.4 - More Comfortable with SQL?
Chapter 2 - Exploring the Spark APIs:
-  2.1 - Non-Trivial Data Structures in Spark-  2.1.1 - Struct Types (StructType)
-  2.1.2 - Arrays and Lists (ArrayType)
-  2.1.3 - Maps and Dictionaries (MapType)
-  2.1.4 - Decimals and Why did my Decimals overflow :( (DecimalType)
 
-  
-  2.2 - Performing your First Transformations-  2.2.1 - Looking at Your Data (collect/head/take/first/toPandas/show)
-  2.2.2 - Selecting a Subset of Columns (drop/select)
-  2.2.3 - Creating New Columns and Transforming Data (withColumn/withColumnRenamed)
-  2.2.4 - Constant Values and Column Expressions (lit/col)
-  2.2.5 - Casting Columns to a Different Type (cast)
-  2.2.6 - Filtering Data (where/filter/isin)
-  2.2.7 - Equality Statements in Spark and Comparisons with Nulls (isNotNull()/isNull())
-  2.2.8 - Case Statements (when/otherwise)
-  2.2.9 - Filling in Null Values (fillna/coalesce)
-  2.2.10 - Spark Functions aren't Enough, I Need my Own! (udf/pandas_udf)
-  2.2.11 - Unionizing Multiple Dataframes (union)
-  2.2.12 - Performing Joins (clean one) (join)
 
-  
-  2.3 More Complex Transformations-  2.3.1 - One to Many Rows (explode)
-  2.3.2 - Range Join Conditions (WIP) (join)
 
-  
-  2.4 Potential Performance Boosting Functions-  2.4.1 - (repartition)
-  2.4.2 - (coalesce)
-  2.4.2 - (cache)
-  2.4.2 - (broadcast)
 
-  
Chapter 3 - Aggregates:
-  3.1 - Clean Aggregations
-  3.2 - Non Deterministic Behaviours
Chapter 4 - Window Objects:
Chapter 5 - Error Logs:
Chapter 6 - Understanding Spark Performance:
-  6.1 - Primer to Understanding Your Spark Application-  6.1.1 - Understanding how Spark Works
-  6.1.2 - Understanding the SparkUI
-  6.1.3 - Understanding how the DAG is Created
-  6.1.4 - Understanding how Memory is Allocated
 
-  
-  6.2 - Analyzing your Spark Application-  6.1 - Looking for Skew in a Stage
-  6.2 - Looking for Skew in the DAG
-  6.3 - How to Determine the Number of Partitions to Use
 
-  
-  6.3 - How to Analyze the Skew of Your Data
Chapter 7 - High Performance Code:
-  7.0 - The Types of Join Strategies in Spark-  7.0.1 - You got a Small Table? (Broadcast Join)
-  7.0.2 - The Ideal Strategy (BroadcastHashJoin)
-  7.0.3 - The Default Strategy (SortMergeJoin)
 
-  
-  7.1 - Improving Joins-  7.1.1 - Filter Pushdown
-  7.1.2 - Joining on Skewed Data (Null Keys)
-  7.1.3 - Joining on Skewed Data (High Frequency Keys I)
-  7.1.4 - Joining on Skewed Data (High Frequency Keys II)
-  7.1.5 - Join Ordering
 
-  
-  7.2 - Repeated Work on a Single Dataset (caching)-  7.2.1 - caching layers
 
-  
-  7.3 - Spark Parameters-  7.3.1 - Running Multiple Spark Applications at Scale (dynamic allocation)
-  7.3.2 - The magical number2001(partitions)
-  7.3.3 - Using a lot ofUDFs? (python memory)
 
-  
-  7. - Bloom Filters :o?