The Complete Apache Spark Collection [Tutorials and Articles]
Join the DZone community and get the full member experience.
Join For FreeIn this edition of "Best of DZone," we've compiled our best tutorials and articles on one of the most popular analytics engines for data processing, Apache Spark. Whether you're a beginner or are a long-time user, but have run into inevitable bottlenecks, we've got your back!
Before we begin, we'd like need to thank those who were a part of this article. DZone has and continues to be a community powered by contributors like you who are eager and passionate to share what they know with the rest of the world.
Let's get started!
Getting Started
Installation
Apache Spark on Windows by Kuldeep Singh — If you were confused by Spark's quick-start guide, this article contains resolutions to the more common errors encountered by developers.
Apache Spark Tutorial (Fast Data Architecture Series) by Bill Ward — In this article, a data scientist and developers gives an Apache Spark tutorial that demonstrates how to get Apache Spark installed.
Theory
Overview of the Apache Spark Ecosystem by Frank Evans — Make the under-the-hood elements of Spark less of a mystery and transfer existing programming knowledge and methods into the power of the Spark engine.
Lambda Architecture With Apache Spark by Taras Matyashovskyy — This blog post will introduce you to the Lambda Architecture designed to take advantage of both batch and streaming processing methods.
How Does Spark Use MapReduce? by Anubhav Tarar — Apache Spark does use MapReduce, but only the idea of it, not the exact implementation. Confused? Let's talk about an example.
Introduction to Apache Spark's Core API (Part Iand Part II) by Anil Afrawal — Take a quick look at how to work with the functions and methods contained in Spark's core API using Python.
Spark vs Kafka vs Flink
Spark Streaming vs. Kafka Streaming by Mahesh Chand Kandpal — If event time is very relevant and latencies in the seconds are completely unacceptable, Kafka should be your first choice. Otherwise, Spark works just fine.
Streaming in Spark, Flink, and Kafka by Shivangi Gupta — There is a lot of buzz going on between when to use Spark, when to use Flink, and when to use Kafka. Get it all straight in this article.
Apache Flink vs. Apache Spark by Ivan Mushketyk — Should you switch to Apache Flink? Should you stick with Apache Spark for a while? Get the answers to these and other questions.
Hadoop vs Spark — Choosing the Right Big Data Framework by Sunil Goyal — Find the right framework for your big data needs.
Streaming and Structured Streaming
Getting Started With Spark Streaming by Carol McDonald — An introduction to Spark Streaming and how to use it with an example data set.
What Is Structured Streaming? by Himanshu Gupta — Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.
Spark Streaming vs. Structured Streaming by Anuj Saxena — Take a look at these two open source data streaming platforms and the scenarios in which each works best.
Spark Clusters
Apache Spark: Setting Up a Cluster on AWS by Jay Sridhar — You can augment and enhance Apache Spark clusters using Amazon EC2's computing resources. Find out how to set up clusters and run master and slave daemons on one node.
Databases, RDDs, and DataFrames
What Is RDD in Spark and Why Do We Need It? by Saurabh Chhajed — Everything you need to understand how Resilient Distributed Datasets (RDDs) function in Spark.
What Is Spark SQL? by Todd McGrath — Spark SQL allows you to use data frames in Python, Java, and Scala; read and write data in a variety of structured formats; and query Big Data with SQL.
Convert RDD to DataFrame With Spark by Mark Needham — Learn how to convert an RDD to DataFrame in Databricks Spark CSV library.
Reading Data From Oracle Database With Apache Spark by Emrah Mete — Learn how to connect Apache Spark to an Oracle database, read the data directly, and write it in a DataFrame.
What Are Spark Checkpoints on Data Frames? by Jean Georges Perrin — Checkpoints freeze the content of your DataFrames before performing additional operations. They're essential to effectively managing your DataFrames.
The Right Way to Use Spark and JDBC by Avi Yehuda — Apache Spark is a wonderful tool, but sometimes it needs a bit of tuning. We look at a use case involving reading data from a JDBC source.
Performance Optimization
Understanding Apache Spark Failures and Bottlenecks by Rishitesh Mishra — When everything goes according to plan, it's easy to write and understand applications in Apache Spark. However, sometimes a well-tuned application might fail due to a data change or a data layout change — or an application that had been running well so far, might start behaving badly due to resource starvation.
Smart Resource Utilization With Spark Dynamic Allocation by Haim Cohen — Configuring your Spark applications wisely will provide you with a good balance between smart allocation and performance.
Apache Spark Performance Tuning – Degree of Parallelism by Rathnadevi Manivannan — Learn about improving performance and increasing speed through partition tuning in a Spark application running on YARN.
Why Your Spark Applications Are Slow or Failing, Part 1: Memory Management and Part 2: Data Skew and Garbage Collection by Rishitesh Mishra — See how common memory management issues, data skew, and garbage collection can have a significant impact on your Spark application's performance.
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds by Henry Powell and Gianmario Spacagna — Barclays Data Scientist Gianmario Spacagna and Harry Powell, Head of Advanced Analytics, describe how they iteratively process raw data directly from the central data warehouse into Spark and how Tachyon is their key enabling technology.
PySpark Tutorials
Introduction to Spark With Python: PySpark for Beginners by Kislay Keshari — Take a look at how to use Apache Spark with Python (PySpark) in order to perform analysis on robust data sets.
PySpark Tutorial: Learn Apache Spark Using Python by Kislay Keshari — See how to get started with one of the best frameworks to handle big data in real-time and perform analysis in Spark.
PySpark DataFrame Tutorial: Introduction to DataFrames by Kislay Keshari — Explore the idea of DataFrames and how they can they help data analysts make sense of large dataset when paired with PySpark.
How to Perform Distributed Spark Streaming With PySpark by Neha Priya — Look at how to use PySpark to quickly analyze incoming data streams to provide real-time metrics.
PySpark Join Explained by Monika Rathor — See how to use PySpark's Join in order to better manipulate data in a DataFrame in Python.
Scala and Spark
Learning Spark With Scala by Mahesh Chand — Often, processing alone is not enough when it comes to big volumes of data. Data must be processed quickly, in real-time, continuously, and concurrently.
Word Count With Spark and Scala by Emmanouil Gkatziouras — See how exactly you can utilize Scala with Spark together in order to solve the problems that often occurs with word counts.
Scala vs. Python for Apache Spark by Tim Spann — When using Apache Spark for cluster computing, you'll need to choose your language. Scala has its advantages, but see why Python is catching up fast.
Introduction to SparkSession by Abhishek Baranwal — We go over how to use this new feature of Apache Spark 2.0, covering all the Scala and SQL you'll need to get started.
Cleanframes: A Data Cleansing Library for Apache Spark! by Dawid Rutowicz — A developer discusses how to use an open source, Scala-based library that can help take some of the boilerplate code out of data cleansing.
Spark and Machine Learning
Churn Prediction With Apache Spark Machine Learning by Carol McDonald — Learn how to get started using Apache Spark’s machine learning decision trees and machine learning pipelines for classification.
Predictive Analytics With Spark ML by David Moyers — Whether you're running Spark on a large cluster or embedded within a single node app, Spark makes it easy to create predictive analytics with just a few lines of code.
Data Clustering Using Apache Spark by Konur Unyelioglu — This article looks at the analysis of cancer survival using K-means and Gaussian Mixture algorithms.
A Glimpse at the Future of Apache Spark 3.0 With Deep Learning and Kubernetes by Oliver White — Learn how Spark 3.0, Kubernetes, and deep learning all come together.
No One Puts Baby in a Container
Running Apache Spark Applications in Docker Containers by Arseniy Tashoyan — Even once your Spark cluster is configured and ready, you still have a lot of work to do before you can run it in a Docker container. But these tips can help make it easier!
Miscellaneous
Quick Start With Apache Livy by Guglielmo Iozzia — Learn how to get started with Apache Livy, a project in the process of being incubated by Apache that interacts with Apache Spark through a REST interface.
Example ETL Application Using Apache Spark and Hive by Emrah Mete — In this article, we'll read a sample data set with Spark on HDFS (Hadoop File System), do a simple analytical operation, then write to a table that we'll make in Hive.
Game Theory With Apache Spark Part 1, Part 2, Part 3, and Part 4 by Konur Unyelioglu — Go in-depth on Game Theory with Apache Spark in this four-part series.
Be a Part of the Conversation!
Think we missed something? Want to contribute? Let us know in the comments below... or, join the conversation by becoming a member of our community of thousands of developers eager to share their knowledge and passion for programming with others.
Further Reading
Opinions expressed by DZone contributors are their own.
Comments