Spark 基本知识
Papers:
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
生态圈
组件 | 描述 |
---|---|
Spark Core | 基础引擎,提供任务调度、内存管理、容错处理和数据集成。 |
Spark SQL | 用于处理结构化数据,支持 SQL 查询、Hive QL、DataFrame 和 Dataset API。 |
Spark Streaming | 实时数据处理组件,处理实时数据流,并将其转换为批处理任务。 |
MLib | 机器学习库,提供常用的机器学习算法和工具。 |
GraphX | 图计算库,处理图数据结构并执行图算法。 |
独立调度器 | Spark 自带的简单资源管理和调度系统,适用于小型集群。 |
YARN | Hadoop 的资源管理系统,支持在 Hadoop 集群上运行 Spark。 |
Mesos | 通用的集群管理系统,支持多种分布式计算框架,包括 Spark。 |
ML with PySpark
API Reference: https://spark.apache.org/docs/3.1.3/api/python/reference/pyspark.ml.html
- Load data & Preprocessing & Standardization
- Exploratory Data Analysis
- Feature engineering -> Feature vector
- Model design & Performance measurements
- Model build (train, valid, test)
- Visualization & Interpreting model
- Model Deploy
PySpark
Reference:
https://spark.apache.org/docs/latest/api/python/index.html
https://spark.apache.org/docs/latest/api/python/reference/index.html
DataFrame
Reference: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.html
Reference: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.agg.html
- df.agg(*exprs):
Function
Reference: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html
Analysis Strategy
Finance: RFM
Reference: https://www.investopedia.com/terms/r/rfm-recency-frequency-monetary-value.asp
- Recency
- Frequency
- Monetary Value