当前位置:网站首页>1 sparksql overview
1 sparksql overview
2022-07-19 05:33:00 【Mmj666】
1 SparkSQL summary
1.1 SparkSQL What is it?
Spark SQL yes Spark For structured data (structured data) To deal with the Spark modular .
1.2 Hive and SparkSQL
Shark There are two branches :SparkSQL and Hive on Spark
SparkSQL As Spark A member of the ecology continues to develop , And no longer limited by Hive, Just compatible Hive;
Hive on Spark It's a Hive Development plan of , The plan will Spark As Hive One of the underlying engines , That is to say say ,Hive Will no longer be limited to one engine , May adopt Map-Reduce、Tez、Spark Equal engine .
For developers ,SparkSQL Sure simplify RDD Development of , Improve development efficiency , And the execution efficiency is very fast , So in practice , Basically SparkSQL.Spark SQL In order to simplify the RDD Development of , Improve development efficiency , Provides 2 Programming abstractions , similar Spark Core Medium RDD
- DataFrame
- DataSet
1.3 SparkSQL characteristic
1.3.1 Easy integration
Seamless integration SQL Query and Spark Programming
1.3.2 Unified data access
Connect different data sources in the same way
1.3.3 compatible Hive
Run directly on the existing warehouse SQL perhaps HiveQL
1.3.4 Standard data connection
adopt JDBC perhaps ODBC To connect
1.4 DataFrame
stay Spark in ,DataFrame Is a kind of RDD Based on distributed data sets , Similar to two-dimensional tables in traditional databases .
DataFrame And RDD The main difference :
- The former has a schema Meta information , namely DataFrame Each column of the represented two-dimensional table dataset has a name and type . This makes Spark SQL To gain insight into more structures Information , So as to hide in DataFrame The data source behind it and its effect on DataFrame The above transformation has been optimized , Finally, it achieves the goal of greatly improving the efficiency of runtime .
- RDD, Because there is no way to know the specific internal structure of the data elements stored ,Spark Core Only in stage The level is simple 、 General pipeline optimization .
And Hive similar ,DataFrame Nested data types are also supported (struct、array and map). from API In terms of ease of use ,DataFrame API It provides a set of high-level relationship operations , Than functional RDD API want More friendly , Lower threshold .

Left side RDD[Person] Although with Person Is a type parameter , but Spark The framework itself does not understand Person Inner of class Department structure .
On the right side of the DataFrame But it provides detailed structural information , bring Spark SQL It is clear that Which columns are included in this dataset , What are the names and types of each column .
DataFrame It's for data Schema The view of . It can be treated as a table in the database .
DataFrame It's also lazy to execute , But the performance is better than RDD higher , Main cause : Optimized execution plan , That is, query calculation Row through Spark catalyst optimiser To optimize .
1.5 DataSet What is it?
DataSet yes Distributed data sets .DataSet yes Spark 1.6 A new abstraction added to the , yes DataFrame An extension of . It provides RDD The advantages of ( Strong type , Use powerful lambda Capability of a function ) as well as Spark SQL Optimize the execution engine The advantages of .DataSet You can also use Functional transformation ( operation map,flatMap,filter wait ).
- DataSet yes DataFrame API An extension of , yes SparkSQL The latest data abstraction
- User friendly API style , It has both type safety inspection and DataFrame The query optimization feature of ;
- Use the sample class for DataSet Structure information of data defined in , The name of each attribute in the sample class maps directly to DataSet Field name in ;
- DataSet It's a strong type of . For example, there can be DataSet[Car],DataSet[Person];
- DataFrame yes DataSet Special column of ,DataFrame=DataSet[Row] , So you can go through as Methods will DataFrame Convert to DataSet.Row Is a type , Follow Car、Person These are of the same type , be-all Table structure information is used Row To express . When getting data, you need to specify the order
source : Silicon Valley Spark For learning purposes only
边栏推荐
- Use of log4j
- Single arm routing configuration
- 【Bug解决】org.apache.ibatis.type.TypeException: The alias ‘xxxx‘ is already mapped to the value ‘xxx‘
- 指针数组&数组指针
- Use Flink SQL to transfer market data 1: transfer VWAP
- 9.数据仓库搭建之DIM层搭建
- Pointer function of C language
- Buuctf miscellaneous - QR code
- 2.东软跨境电商数仓项目技术选型
- Talk about 20 negative teaching materials for writing code
猜你喜欢

1.东软跨境电商数仓需求规格说明文档

Distributed storage fastdfs

Common interview questions of operating system

UML (use case diagram, class diagram, object diagram, package diagram)

Teach you to reproduce log4j2 nuclear weapon level vulnerability hand in hand

Swagger configuration and use

Questions d'entrevue courantes du système d'exploitation

Redis source code analysis 3 implementation of discontinuous traversal

Macro definition of C language

3.东软跨境电商数仓项目架构设计
随机推荐
mysql的事务
一次全面的性能优化,从5秒优化到1秒
Functions and parameters
Syntax differences between PgSQL and Oracle (SQL migration records)
From 20s to 500ms, I used these three methods
Excel imports long data and changes to 000 at the end
聊聊写代码的20个反面教材
Network command: network card information, netstat, ARP
用户态协议栈-基于netmap的UDP实现
递归的应用
聊聊redis分布式锁的8大坑
Performance bottleneck finding - Flame graph analysis
线上软件测试培训机构柠檬班与iTest.AI平台达成战略合作
分布式存储-fastdfs
MySQL--触发器与视图
面试官:大量请求 Redis 不存在的数据,从而影响数据库,该如何解决?
Use of log4j
Use Flink SQL to transfer market data 1: transfer VWAP
面渣逆袭:线程池夺命连环十八问,面试官直夸我
Round robin schedule problem