Advanced Analytics with Spark: Patterns for Learning from Data at Scale
277Advanced Analytics with Spark: Patterns for Learning from Data at Scale
277Paperback
-
PICK UP IN STORECheck Availability at Nearby Stores
Available within 2 business hours
Related collections and offers
Overview
You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance.
If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.
With this book, you will:
- Familiarize yourself with the Spark programming model
- Become comfortable within the Spark ecosystem
- Learn general approaches in data science
- Examine complete implementations that analyze large public data sets
- Discover which machine learning tools make sense for particular problems
- Acquire code that can be adapted to many uses
Product Details
ISBN-13: | 9781491972953 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 07/07/2017 |
Pages: | 277 |
Sales rank: | 877,027 |
Product dimensions: | 7.00(w) x 8.60(h) x 0.60(d) |
About the Author
Uri Laserson is an Assistant Professor of Genetics at the Icahn School of Medicine at Mount Sinai, where he develops scalable technology for genomics and immunology using the Hadoop ecosystem.
Sean Owen is Director of Data Science at Cloudera. He is an Apache
Spark committer and PMC member, and was an Apache Mahout committer.
Josh Wills is the Head of Data Engineering at Slack, the founder of the Apache Crunch project, and wrote a tweet about data scientists once.
Table of Contents
Foreword vii
Preface ix
1 Analyzing Big Data 1
The Challenges of Data Science 3
Introducing Apache Spark 4
About This Book 6
The Second Edition 7
2 Introduction to Data Analysis with Scala and Spark 9
Scala for Data Scientists 10
The Spark Programming Model 11
Record Linkage 12
Getting Started: The Spark Shell and SparkContext 13
Bringing Data from the Cluster to the Client 19
Shipping Code from the Client to the Cluster 22
From RDDs to Data Frames 23
Analyzing Data with the DataFrame API 26
Fast Summary Statistics for DataFrames 32
Pivoting and Reshaping DataFrames 33
Joining DataFrames and Selecting Features 37
Preparing Models for Production Environments 38
Model Evaluation 40
Where to Go from Here 41
3 Recommending Music and the Audioscrobbler Data Set 43
Data Set 44
The Alternating Least Squares Recommender Algorithm 45
Preparing the Data 48
Building a First Model 51
Spot Checking Recommendations 54
Evaluating Recommendation Quality 57
Computing AUC 58
Hyperparameter Selection 60
Making Recommendations 62
Where to Go from Here 64
4 Predicting Forest Cover with Decision Trees 67
Fast Forward to Regression 67
Vectors and Features 68
Training Examples 69
Decision Trees and Forests 70
Covtype Data Set 73
Preparing the Data 73
A First Decision Tree 76
Decision Tree Hyperparameters 82
Tuning Decision Trees 84
Categorical Features Revisited 88
Random Decision Forests 91
Making Predictions 93
Where to Go from Here 94
5 Anomaly Detection in Network Traffic with K-means Clustering 97
Anomaly Detection 98
K-means Clustering 98
Network Intrusion 99
KDD Cup 1999 Data Set 100
A First Take on Clustering 101
Choosing k 103
Visualization with SparkR 106
Feature Normalization 110
Categorical Variables 112
Using Labels with Entropy 114
Clustering in Action 115
Where to Go from Here 117
6 Understanding Wikipedia with Latent Semantic Analysis 119
The Document-Term Matrix 120
Getting the Data 122
Parsing and Preparing the Data 122
Lemmatization 124
Computing the TF-IDFs 125
Singular Value Decomposition 127
Finding Important Concepts 129
Querying and Scoring with a Low-Dimensional Representation 133
Term-Term Relevance 134
Document-Document Relevance 136
Document-Term Relevance 137
Multiple-Term Queries 138
Where to Go from Here 140
7 Analyzing Co-Occurrence Networks with GraphX 141
The MEDLINE Citation Index: A Network Analysis 143
Getting the Data 144
Parsing XML Documents with Scala's XML Library 146
Analyzing the MeSH Major Topics and Their Co-Occurrences 147
Constructing a Co-Occurrence Network with GrapbX 150
Understanding the Structure of Networks 154
Connected Components 154
Degree Distribution 157
Filtering Out Noisy Edges 159
Processing EdgeTriplets 160
Analyzing the Filtered Graph 162
Small-World Networks 163
Cliques and Clustering Coefficients 164
Computing Average Path Length with Pregel 165
Where to Go from Here 170
8 Geospatial and Temporal Data Analysis on New York City Taxi Trip Data 173
Getting the Data 174
Working with Third-Party Libraries in Spark 175
Geospatial Data with the Esri Geometry API and Spray 176
Exploring the Esri Geometry API 176
Intro to GeoJSON 178
Preparing the New York City Taxi Trip Data 180
Handling Invalid Records at Scale 182
Geospatial Analysis 186
Sessionization in Spark 189
Building Sessions: Secondary Sorts in Spark 190
Where to Go from Here 193
9 Estimating Financial Risk Through Monte Carlo Simulation 195
Terminology 196
Methods for Calculating VaR 197
Variance-Covariance 197
Historical Simulation 197
Monte Carlo Simulation 197
Our Model 198
Getting the Data 199
Preprocessing 199
Determining the Factor Weights 202
Sampling 205
The Multivariate Normal Distribution 208
Running the Trials 209
Visualizing the Distribution of Returns 212
Evaluating Our Results 213
Where to Go from Here 215
10 Analyzing Genomics Data and the BDG Project 217
Decoupling Storage from Modeling 218
Ingesting Genomics Data with the ADAM CLI 221
Parquet Format and Columnar Storage 227
Predicting Transcription Factor Binding Sites from ENCODE Data 229
Querying Genotypes from the 1000 Genomes Project 236
Where to Go from Here 239
11 Analyzing Neuroimaging Data with PySpark and Thunder 241
Overview of PySpark 242
PySpark Internals 243
Overview and Installation of the Thunder Library 245
Loading Data with Thunder 245
Thunder Core Data Types 252
Categorizing Neuron Types with Thunder 253
Where to Go from Here 258
Index 259