当前位置:网站首页>Study notes of dataX
Study notes of dataX
2022-07-26 09:05:00 【It is not easy to live in vain】
Datax Learning notes of
List of articles
1. brief introduction
Datax It is widely used in Alibaba Group Offline synchronization tool for heterogeneous data sources , Committed to achieving including Relational database (MySQL、Oracle etc. )、HDFS、Hive、MaxCompute( primary ODPS)、HBase、FTP etc. Various heterogeneous data sources ( That is, different databases ) Stable and efficient data synchronization between .
1.1 Design concept
In order to solve the problem of heterogeneous data source synchronization ,DataX Turn the complex mesh synchronization link into a star data link ,DataX As an intermediate transmission carrier, it is responsible for connecting various data sources . When you need to access a new data source , Just connect this data source pair to DataX, It can achieve seamless data synchronization with existing data sources .
1.2 framework design
DataX As an offline data synchronization framework , use Framework + plugin Architecture building . Abstract data source read and write as Reader/Writer plug-in unit , Integrated into the entire synchronization framework .
Reader:Reader Data acquisition module , Responsible for collecting data from data sources , Send the data to Framework.
Writer:Writer Write module for data , To be responsible for keeping up with Framework Take data from , And write the data to the destination .
Framework:Framework Used to connect to reader and writer, As the data transmission channel of both , And handle buffers , Flow control , Concurrent , Data conversion and other core technical issues .
DataX 3.0 The open source version supports single machine multithreading mode to complete synchronous job running . Details refer to : Alicloud open source offline synchronization tool DataX3.0 Introduce
1.3 advantage
- Reliable data quality monitoring ( So that the data can be transmitted to the destination without damage )
- Rich data conversion function
- Precise speed control
- The new version DataX3.0 Provided including access ( Concurrent )、 Record stream 、 Byte stream has three flow control modes , You can control your homework speed at will , Let your work achieve the best synchronization speed within the range that the library can bear .
- Strong synchronization performance : Each reading plug-in has one or more segmentation strategies , Can reasonably divide the homework into multiple Task Parallel execution , The single machine multithreaded execution model can make DataX The speed increases linearly with concurrency .
- Robust fault tolerance mechanism ( Multilevel local / Global retry )
- Minimalist experience . Download and use 、 Detailed log information .
1.4 System requirements
- Linux
- JDK(1.8 above , recommend 1.8)
- Python( recommend Python2.6 X)
- Apache Maven 3.x(Compile DataX)
1.4 build
Official website steps :https://github.com/alibaba/DataX/blob/master/userGuid.md
2. Relevant concepts
Heterogeneous data sources
Refers to data between different database management systems . In the process of enterprise information construction , Due to the phased construction and implementation of data management system in each business system 、 Technical and other economic and human factors , As a result, enterprises have accumulated a large number of business data with different storage methods in the process of development , Including the data management systems used , From simple file database to complex network database , They constitute the heterogeneous data source of the enterprise .
The heterogeneity of enterprise data sources is mainly manifested in 3 aspect :
- System heterogeneity , That is, the business application system on which the data source depends 、 The differences between database management systems and even operating systems constitute system heterogeneity .
- Pattern heterogeneity , That is, the data source is different in storage mode . Storage patterns mainly include relational patterns 、 Object mode 、 Object relationship pattern and document nesting , Among them, the relationship mode ( relational database ) It is the mainstream storage mode . meanwhile , Even the same storage mode , There may also be differences in their model structures . For example, the data types of different relational data management systems are not completely consistent , Such as DB2、Oracle、Sybase、Informix、SQL Server、Foxpro etc. .
- Heterogeneous sources , That is, the heterogeneity between internal data sources and external data sources .
3. DataX3.0 Core architecture
DataX Complete the operation of single data passing , We become Job,DataX Received a Job after , A process will be started to complete the job synchronization process .DataX Job The module is the central management node of a single job , Data cleaning 、 Sub task segmentation 、TaskGroup Management and other functions .
- DataX Job After starting , According to the segmentation strategies of different sources , take Job Cut into smaller ones Task ( The subtasks ), To facilitate concurrent execution .
- next DataX Job Would call Scheduler modular , According to the configured concurrent number , To divide into Task Back together , Assemble into TaskGroup ( Task force )
- every last Task All by TaskGroup Responsible for starting ,Task After starting , It will start in a fixed way Reader --> Channel --> Writer Thread to complete task synchronization .
- DataX After the job is started ,Job Would be right TaskGroup Conduct monitoring operation , Wait for all TaskGroup After completion ,Job Will be successfully launched ( Abnormal exit Value not 0)
Composition Introduction :
- Job: The management node of a single job , Responsible for data cleaning 、 Subtask Division 、TashGroup Monitoring management .
- Task: from Job Cut it up , yes DataX The smallest unit of work , Every Task Be responsible for the synchronization of some data .
- Schedule: take Task form TaskGroup, Single TaskGroup The number of concurrent is 5.
- TaskGroup: Responsible for starting Task.
DataX Scheduling process :
- First DataX Job The module will be divided into several modules according to the sub database and sub table Task, Then, according to the user configuration, the number of concurrent , To calculate how many... Need to be allocated TaskGroup;
- The calculation process :
Task / Channel = TaskGroup
, Finally by TaskGroup Run according to the allocated concurrency number Task ( Mission )
边栏推荐
- [recommended collection] MySQL 30000 word essence summary index (II) [easy to understand]
- mysql函数
- Learning notes of automatic control principle - Performance Analysis of continuous time system
- Learn more about the difference between B-tree and b+tree
- Innovus卡住,提示X Error:
- 【ARKit、RealityKit】把图片转为3D模型
- Typescript encryption tool passwordencoder
- Study notes of automatic control principle -- correction and synthesis of automatic control system
- Set of pl/sql -2
- 对标注文件夹进行清洗
猜你喜欢
Vision Group Training Day5 - machine learning, image recognition project
[leetcode database 1050] actors and directors who have cooperated at least three times (simple question)
围棋智能机器人阿法狗,阿尔法狗机器人围棋
优秀的 Verilog/FPGA开源项目介绍(三十零)- 暴力破解MD5
Unity topdown character movement control
JDBC database connection pool (Druid Technology)
李沐d2l(四)---Softmax回归
(2006,Mysql Server has gone away)问题处理
Cat安装和使用
The essence of attack and defense strategy behind the noun of network security
随机推荐
The largest number of statistical absolute values --- assembly language
Error: Cannot find module ‘umi‘ 问题处理
Uni app simple mall production
Pat grade a A1034 head of a gang
Media at home and abroad publicize that we should strictly grasp the content
redis原理和使用-安装和分布式配置
JS file import of node
Advanced mathematics | Takeshi's "classic series" daily question train of thought and summary of error prone points
PHP 之 Apple生成和验证令牌
Study notes of automatic control principle -- correction and synthesis of automatic control system
深度学习常用激活函数总结
数据库操作 技能6
day06 作业--技能题2
The lessons of 2000. Web3 = the third industrial revolution?
2022茶艺师(中级)特种作业证考试题库模拟考试平台操作
力扣题DFS
tornado之多进程服务
PHP和MySQL获取week值不一致的处理
ext3文件系统的一个目录下,无法创建子文件夹,但可以创建文件
at、crontab