当前位置：网站首页>1 sparksql overview

1 sparksql overview

2022-07-19 05:33:00 【Mmj666】

1 SparkSQL summary

1.1 SparkSQL What is it?

Spark SQL yes Spark For structured data (structured data) To deal with the Spark modular .

1.2 Hive and SparkSQL

Shark There are two branches ：SparkSQL and Hive on Spark

SparkSQL As Spark A member of the ecology continues to develop , And no longer limited by Hive, Just compatible Hive;
Hive on Spark It's a Hive Development plan of , The plan will Spark As Hive One of the underlying engines , That is to say say ,Hive Will no longer be limited to one engine , May adopt Map-Reduce、Tez、Spark Equal engine .

For developers ,SparkSQL Sure simplify RDD Development of , Improve development efficiency , And the execution efficiency is very fast , So in practice , Basically SparkSQL.Spark SQL In order to simplify the RDD Development of , Improve development efficiency , Provides 2 Programming abstractions , similar Spark Core Medium RDD

DataFrame
DataSet

1.3 SparkSQL characteristic

1.3.1 Easy integration

Seamless integration SQL Query and Spark Programming

1.3.2 Unified data access

Connect different data sources in the same way

1.3.3 compatible Hive

Run directly on the existing warehouse SQL perhaps HiveQL

1.3.4 Standard data connection

adopt JDBC perhaps ODBC To connect

1.4 DataFrame

stay Spark in ,DataFrame Is a kind of RDD Based on distributed data sets , Similar to two-dimensional tables in traditional databases .

DataFrame And RDD The main difference ：

The former has a schema Meta information , namely DataFrame Each column of the represented two-dimensional table dataset has a name and type . This makes Spark SQL To gain insight into more structures Information , So as to hide in DataFrame The data source behind it and its effect on DataFrame The above transformation has been optimized , Finally, it achieves the goal of greatly improving the efficiency of runtime .
RDD, Because there is no way to know the specific internal structure of the data elements stored ,Spark Core Only in stage The level is simple 、 General pipeline optimization .

And Hive similar ,DataFrame Nested data types are also supported （struct、array and map）. from API In terms of ease of use ,DataFrame API It provides a set of high-level relationship operations , Than functional RDD API want More friendly , Lower threshold .

Please add a picture description

Left side RDD[Person] Although with Person Is a type parameter , but Spark The framework itself does not understand Person Inner of class Department structure .

On the right side of the DataFrame But it provides detailed structural information , bring Spark SQL It is clear that Which columns are included in this dataset , What are the names and types of each column .

DataFrame It's for data Schema The view of . It can be treated as a table in the database .

DataFrame It's also lazy to execute , But the performance is better than RDD higher , Main cause ： Optimized execution plan , That is, query calculation Row through Spark catalyst optimiser To optimize .

1.5 DataSet What is it?

DataSet yes Distributed data sets .DataSet yes Spark 1.6 A new abstraction added to the , yes DataFrame An extension of . It provides RDD The advantages of （ Strong type , Use powerful lambda Capability of a function ） as well as Spark SQL Optimize the execution engine The advantages of .DataSet You can also use Functional transformation （ operation map,flatMap,filter wait ）.

DataSet yes DataFrame API An extension of , yes SparkSQL The latest data abstraction
User friendly API style , It has both type safety inspection and DataFrame The query optimization feature of ;
Use the sample class for DataSet Structure information of data defined in , The name of each attribute in the sample class maps directly to DataSet Field name in ;
DataSet It's a strong type of . For example, there can be DataSet[Car],DataSet[Person];
DataFrame yes DataSet Special column of ,DataFrame=DataSet[Row] , So you can go through as Methods will DataFrame Convert to DataSet.Row Is a type , Follow Car、Person These are of the same type , be-all Table structure information is used Row To express . When getting data, you need to specify the order

source ： Silicon Valley Spark For learning purposes only

原网站

版权声明
本文为[Mmj666]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207170508098332.html