本書作者比爾·錢伯斯和馬太·扎哈里亞在強調(diào)Spark 2.0的改進和新功能的同時,將Spark題分為不同的部分,每個部分都有其獨特的目標。你將探索Spark的結(jié)構(gòu)化API的基本操作和常見功能以及Structured Streaming,后者是用于構(gòu)建端到端流應用的一種全新的高層API。開發(fā)人員和系統(tǒng)管理員會學Spark監(jiān)控、調(diào)優(yōu)、調(diào)試的基礎(chǔ)知識,探索機器學習技術(shù)以及Spark可擴展機器學習庫MLlib的部署場景。
Preface
Part I. Gentle Overview of Big Data and Spark
1. What Is Apache Spark
Apache Spark's Philosophy
Context: The Big Data Problem
History of Spark
The Present and Future of Spark
Running Spark
Downloading Spark Locally
Launching Spark's Interactive Consoles
Running Spark in the Cloud
Data Used in This Book
2. A Gentle Introduction to Spark
Spark's Basic Architecture
Spark Applications
Spark's Language APIs
Spark's APIs
Starting Spark
The SparkSession
DataFrames
Partitions
Transformations
Lazy Evaluation
Actions
Spark UI
An End-to-End Example
DataFrames and SQL
Conclusion
3. A Tour of Spark's Too1set
Running Production Applications
Datasets: Type-Safe Structured APIs
Structured Streaming
Machine Learning and Advanced Analytics
Lower-Level APIs
SparkR
Spark's Ecosystem and Packages
Conclusion
Part II. Structured APls--DataFrames, SQL, and Datasets
4. Structured API Overview
DataFrames and Datasets
Schemas
Overview of Structured Spark Types
DataFrames Versus Datasets
Columns
Rows
Spark Types
Overview of Structured API Execution
Logical Planning
Physical Planning
Execution