Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Paperback

$59.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Related collections and offers


Overview

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance.

If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.

With this book, you will:

  • Familiarize yourself with the Spark programming model
  • Become comfortable within the Spark ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public data sets
  • Discover which machine learning tools make sense for particular problems
  • Acquire code that can be adapted to many uses

Product Details

ISBN-13: 9781491972953
Publisher: O'Reilly Media, Incorporated
Publication date: 07/07/2017
Pages: 277
Sales rank: 877,027
Product dimensions: 7.00(w) x 8.60(h) x 0.60(d)

About the Author

Sandy Ryza develops algorithms for public transit at Remix. Prior, he was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project. He holds the Brown Universitycomputer science department's 2012 Twining award for "Most Chill".

Uri Laserson is an Assistant Professor of Genetics at the Icahn School of Medicine at Mount Sinai, where he develops scalable technology for genomics and immunology using the Hadoop ecosystem.

Sean Owen is Director of Data Science at Cloudera. He is an Apache
Spark committer and PMC member, and was an Apache Mahout committer.

Josh Wills is the Head of Data Engineering at Slack, the founder of the Apache Crunch project, and wrote a tweet about data scientists once.

Table of Contents

Foreword vii

Preface ix

1 Analyzing Big Data 1

The Challenges of Data Science 3

Introducing Apache Spark 4

About This Book 6

The Second Edition 7

2 Introduction to Data Analysis with Scala and Spark 9

Scala for Data Scientists 10

The Spark Programming Model 11

Record Linkage 12

Getting Started: The Spark Shell and SparkContext 13

Bringing Data from the Cluster to the Client 19

Shipping Code from the Client to the Cluster 22

From RDDs to Data Frames 23

Analyzing Data with the DataFrame API 26

Fast Summary Statistics for DataFrames 32

Pivoting and Reshaping DataFrames 33

Joining DataFrames and Selecting Features 37

Preparing Models for Production Environments 38

Model Evaluation 40

Where to Go from Here 41

3 Recommending Music and the Audioscrobbler Data Set 43

Data Set 44

The Alternating Least Squares Recommender Algorithm 45

Preparing the Data 48

Building a First Model 51

Spot Checking Recommendations 54

Evaluating Recommendation Quality 57

Computing AUC 58

Hyperparameter Selection 60

Making Recommendations 62

Where to Go from Here 64

4 Predicting Forest Cover with Decision Trees 67

Fast Forward to Regression 67

Vectors and Features 68

Training Examples 69

Decision Trees and Forests 70

Covtype Data Set 73

Preparing the Data 73

A First Decision Tree 76

Decision Tree Hyperparameters 82

Tuning Decision Trees 84

Categorical Features Revisited 88

Random Decision Forests 91

Making Predictions 93

Where to Go from Here 94

5 Anomaly Detection in Network Traffic with K-means Clustering 97

Anomaly Detection 98

K-means Clustering 98

Network Intrusion 99

KDD Cup 1999 Data Set 100

A First Take on Clustering 101

Choosing k 103

Visualization with SparkR 106

Feature Normalization 110

Categorical Variables 112

Using Labels with Entropy 114

Clustering in Action 115

Where to Go from Here 117

6 Understanding Wikipedia with Latent Semantic Analysis 119

The Document-Term Matrix 120

Getting the Data 122

Parsing and Preparing the Data 122

Lemmatization 124

Computing the TF-IDFs 125

Singular Value Decomposition 127

Finding Important Concepts 129

Querying and Scoring with a Low-Dimensional Representation 133

Term-Term Relevance 134

Document-Document Relevance 136

Document-Term Relevance 137

Multiple-Term Queries 138

Where to Go from Here 140

7 Analyzing Co-Occurrence Networks with GraphX 141

The MEDLINE Citation Index: A Network Analysis 143

Getting the Data 144

Parsing XML Documents with Scala's XML Library 146

Analyzing the MeSH Major Topics and Their Co-Occurrences 147

Constructing a Co-Occurrence Network with GrapbX 150

Understanding the Structure of Networks 154

Connected Components 154

Degree Distribution 157

Filtering Out Noisy Edges 159

Processing EdgeTriplets 160

Analyzing the Filtered Graph 162

Small-World Networks 163

Cliques and Clustering Coefficients 164

Computing Average Path Length with Pregel 165

Where to Go from Here 170

8 Geospatial and Temporal Data Analysis on New York City Taxi Trip Data 173

Getting the Data 174

Working with Third-Party Libraries in Spark 175

Geospatial Data with the Esri Geometry API and Spray 176

Exploring the Esri Geometry API 176

Intro to GeoJSON 178

Preparing the New York City Taxi Trip Data 180

Handling Invalid Records at Scale 182

Geospatial Analysis 186

Sessionization in Spark 189

Building Sessions: Secondary Sorts in Spark 190

Where to Go from Here 193

9 Estimating Financial Risk Through Monte Carlo Simulation 195

Terminology 196

Methods for Calculating VaR 197

Variance-Covariance 197

Historical Simulation 197

Monte Carlo Simulation 197

Our Model 198

Getting the Data 199

Preprocessing 199

Determining the Factor Weights 202

Sampling 205

The Multivariate Normal Distribution 208

Running the Trials 209

Visualizing the Distribution of Returns 212

Evaluating Our Results 213

Where to Go from Here 215

10 Analyzing Genomics Data and the BDG Project 217

Decoupling Storage from Modeling 218

Ingesting Genomics Data with the ADAM CLI 221

Parquet Format and Columnar Storage 227

Predicting Transcription Factor Binding Sites from ENCODE Data 229

Querying Genotypes from the 1000 Genomes Project 236

Where to Go from Here 239

11 Analyzing Neuroimaging Data with PySpark and Thunder 241

Overview of PySpark 242

PySpark Internals 243

Overview and Installation of the Thunder Library 245

Loading Data with Thunder 245

Thunder Core Data Types 252

Categorizing Neuron Types with Thunder 253

Where to Go from Here 258

Index 259

From the B&N Reads Blog

Customer Reviews