当前位置:网站首页>Analysis and solution of application jar package conflict in yarn environment
Analysis and solution of application jar package conflict in yarn environment
2022-07-19 10:32:00 【javastart】
Hadoop The framework itself integrates many third-party JAR Package library .Hadoop The framework starts itself or is running the user's MapReduce Wait for the app , Will prioritize Hadoop Preset JAR package . In this case , When the third-party library used by the user's application already exists in Hadoop Preset directory of framework , But the versions of the two are different ,Hadoop Priority will be given to loading applications Hadoop Self preset JAR package , The result of this situation is that applications often fail to function properly .
Let's start from a practical problem we encounter in practice , analyse Hadoop on YARN In the environment ,MapReduce Program runtime JAR Related principles of package search , And give the solution JAR Ideas and methods of package conflict .
One 、 One JAR Examples of package conflicts
One of mine MR The program needs to use jackson library 1.9.13 New interface of version :

chart 1:MR Of pom.xml, rely on jackson Of 1.9.13
But my Hadoop colony (CDH Version of hadoop-2.3.0-cdh5.1.0) Preset jackson The version is 1.8.8 Of , be located Hadoop Install under directory share/hadoop/mapreduce2/lib/ Next .
Use the following command to run my MR The program :
hadoop jar mypackage-0.0.1-jar-with-dependencies.jar com.umeng.dp.MainClass --input=../input.pb.lzo --output=/tmp/cuiyang/output/
because MR Used in the program JsonNode.asText() Method , yes 1.9.13 Version new , stay 1.8.8 There is no , So the error report is as follows :
…
15/11/13 18:14:33 INFO mapreduce.Job: map 0% reduce 0%
15/11/13 18:14:40 INFO mapreduce.Job: Task Id : attempt_1444449356029_0022_m_000000_0, Status : FAILED
Error: org.codehaus.jackson.JsonNode.asText()Ljava/lang/String;
…
Two 、 Clear up YARN The process by which the framework executes the application
Continue to analyze how to solve JAR Before the package conflict problem , We need to understand a very important problem first , It's the user's MR How the program works in NodeManager Running on ? This is what we found JAR The key to the solution of package conflict problem .
This article is not an introduction YARN Framework article , Some basic YARN Our knowledge assumes that everyone already knows , Such as ResourceManager( Hereinafter referred to as RM),NodeManager( Hereinafter referred to as NM),AppMaster( Hereinafter referred to as AM),Client,Container this 5 Functions and responsibilities of the core components , And the relationship between them .

chart 2:YARN Architecture diagram
If you are right about YARN It doesn't matter if you don't understand the principle of , It will not affect the understanding of the following article . I will make a brief summary of several key knowledge that will be used in the following articles , Just understand these key points :
From a logical point of view ,Container It can be simply understood as a run Map Task perhaps Reduce Task The process of ( Yes, of course ,AM It's actually a Container, By RM command NM Running ),YARN In order to abstract different framework applications , Designed Container This general concept ;
Container By AM towards NM Send a command to start ;
Container In fact, it's a case of Shell The process started by the script , The script will execute Java Program , To run the Map Task perhaps Reduce Task.
Okay , Let's start with MR The program is in NM Process running on .
The above said ,Map Task perhaps Reduce Task By AM Send to specified NM On , And order NM Running .NM received AM After the command , For every one of them Container Create a local directory , Download program files and resource files to NM In this directory , Then prepare to run Task, In fact, it is ready to start a Container.NM For this Container Dynamically generate a name launch_container.sh Script file for , Then execute the script file . This document is for us to see Container Is the key to how to run !
The two lines related to this problem in the script are as follows :
export CLASSPATH="$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:(... Omit …):$PWD/*"
exec /bin/bash -c "$JAVA_HOME/bin/java -D( Various Java Parameters ) org.apache.hadoop.mapred.YarnChild 127.0.0.1 58888 ( Other application parameters )"
First look at the 2 That's ok . original , stay YARN function MapReduce when , Every Container It's just an ordinary Java Program ,Main The program entry class is :org.apache.hadoop.mapred.YarnChild.
We know ,JVM When loading a class , Will be based on CLASSPATH Declaration order of the middle path , Find the specified classpath in turn , Until the first target class is found, it will return , Instead of continuing to search . in other words , If two JAR Packages all have the same class , Then who declares in CLASSPATH front , Who will be loaded . This is our solution JAR The key to package conflict !
Look at the first 1 That's ok , Just the definition JVM What you need to run CLASSPATH Variable . You can see ,YARN take Hadoop preset JAR The contents of the package are written in CLASSPATH Foremost . such , As long as it is Hadoop Preset JAR The classes contained in the package , Will take precedence over applications JAR Load classes with the same classpath in the package !
For classes unique to applications ( namely Hadoop There is no preset class ),JVM How is it loaded ? see CLASSPATH The end of the variable definition :"/*:$PWD/*". in other words , If Java If the class cannot be found anywhere else , Finally, you will find in the current directory .
What is the current directory ? As mentioned above ,NM Running Container front , Would be Container Create a separate directory , Then the required resources will be put into this directory , And then run the program . This directory is for storing Container All relevant resources 、 Directory of program files , That is to say launch_container.sh The current directory where the script runs . If you execute the program , Into -libjars Parameters , So designated JAR file , It will also be copied to this directory . such ,JVM You can go through CLASSPATH Variable , Find all the... In the current directory JAR package , Then you can load user self referenced JAR It's packed .
When running an application on my computer , The directory is located in /Users/umeng/worktools/hadoop-2.3.0-cdh5.1.0/ops/tmp/hadoop-umeng/nm-local-dir/usercache/umeng/appcache/application_1444449356029_0023, The contents are as follows ( It can be configured through a configuration file , be omitted ):

chart 3:NM in Job Run time Directory
Okay , We now know why YARN Always load Hadoop Preset class And JAR package , How can we solve this problem ? The way is : Look at the source code ! Find dynamic generation launch_container.sh The place of , See if it can be adjusted CLASSPATH Generation order of variables , take Job The current directory of the runtime , To adjust to CLASSPATH Foremost .
3、 ... and 、 Read the source code , solve the problem
Trace source code , Let's go deep into it , Thoroughly understand everything .
think first of , although launch_container.sh The script file is created by NM Generated , however NM Just run Task The carrier of , And really precise control Container How to run , It should be the brain of the program :AppMaster. View source code , Sure enough, it proved our idea :Container Of CLASSPATH, By MRApps(MapReduce Of AM) Pass to NodeManager Of ,NodeManager Write again sh Script .
MRApps Medium TaskAttemptImpl::createCommonContainerLaunchContext() Method creates a Container, After that Container Will be serialized and passed directly to NM; The implementation of this method , The calling relationship is :createContainerLaunchContext() -> getInitialClasspath()-> MRApps.setClasspath(env, conf). First , Let's see setClasspath():

First , Will judge userClassesTakesPrecedence, If I set this Flag, Then you won't call MRApps.setMRFrameworkClasspath(environment, conf) This method . in other words , If I set this Flag Words , Users need to set all JAR Bag CLASSPATH.
Let's see setMRFrameworkClasspath() Method :

among ,DEFAULT_YARN_APPLICATION_CLASSPATH Put all Hadoop preset JAR The package directory . Be able to see , The framework will be used first YarnConfiguration.YARN_APPLICATION_CLASSPATH Set up CLASSPATH, If not set , Will use DEFAULT_YARN_APPLICATION_CLASSPATH.
Then from conf.getStrings() Convert the configuration string into a string array separated by commas ;Hadoop Iterate over the array , In turn, calls MRApps.addToEnvironment(environment, Environment.CLASSPATH.name(), c.trim(), conf) Set up CLASSPATH.
See here , We saw a glimmer of dawn : By default ,MRApps Will use DEFAULT_YARN_APPLICATION_CLASSPATH As Task Default CLASSPATH. If we want to change CLASSPATH, So it seems that we need to modify YARN_APPLICATION_CLASSPATH, Let this variable not be empty .
therefore , We added the following statement to the application :
String[] classpathArray = config.getStrings(YarnConfiguration.YARN_APPLICATION_CLASSPATH, YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH);
String cp = "$PWD/*:" + StringUtils.join(":", classpathArray);
config.set(YarnConfiguration.YARN_APPLICATION_CLASSPATH, cp);The above sentence means : Get it first YARN Default settings DEFAULT_YARN_APPLICATION_CLASSPATH, Then add... At the beginning Task The current directory where the program is running , Then set it to YARN_APPLICATION_CLASSPATH Variable . such ,MRApps Creating Container when , It will change our modified 、 The current program directory takes precedence CLASSPATH, As Container Runtime CLASSPATH.
The last step , We need to rely on JAR package , Put in Task In the running directory , When loading classes like this , To load the classes we really need . So how do you do that ? Yes , Is the use of -libjars This parameter , This has been explained before . such , The command to run the program is changed as follows :
hadoop jar ./target/mypackage-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.umeng.dp.MainClass-libjars jackson-mapper-asl-1.9.13.jar,jackson-core-asl-1.9.13.jar --input=../input.pb.lzo --output=/tmp/cuiyang/output/
Four 、 Conclusion
In this paper , We do it through analysis Hadoop Source code , Solved a problem we encountered JAR Package conflict issues .
Even the most mature and perfect documentation manual , It is impossible to cover all the details of its products to answer all the questions of users , What's more, is Hadoop This non-profit open source framework . The advantage of open source is , When you are confused , You can turn to the source code , Find the answer to the problem by yourself . This is what teacher Hou Jie said : “ The source code in front , No secret ”.
« Last one : YARN The ultimate explanation of memory parameters
边栏推荐
- Convert excel table to word table, and keep the formula in Excel table unchanged
- HCIA 复习作答 2022.7.6
- R语言使用epiDisplay包的kap函数计算配对列联表的计算一致性的比例以及Kappa统计量的值、使用xtabs函数生成二维列联表
- 【Makefile】关于makefile使用上的一些备忘
- [Northeast Normal University] information sharing of postgraduate entrance examination and re examination
- R语言使用epiDisplay包的aggregate函数将数值变量基于因子变量拆分为不同的子集,计算每个子集的汇总统计信息、设置na.rm参数为FALSE之后包含缺失值的分组的统计量的结果为NA
- Smart Lang: VMware fixed virtual machine IP address
- How to realize the association between interfaces in JMeter?
- 新能源赛道已经高位风险,提醒大家注意安全
- English grammar_ Personal pronoun usage
猜你喜欢
![Effectively understand FreeSQL wheredynamicfilter and deeply understand the original design intention [.net orm]](/img/cb/76200539c59bb865e60e5ea1121feb.png)
Effectively understand FreeSQL wheredynamicfilter and deeply understand the original design intention [.net orm]

LVI-SAM:激光-IMU-相机紧耦合建图

HCIA review and answer 2022.7.6

string类的介绍及模拟实现

C# 搭建一个基于.NET5的WPF入门项目

Blender digital twin production tutorial

麒麟信安操作系统衍生产品解决方案 | 主机安全加固软件,实现一键快速加固!

查找——平衡二叉树

C # treeview tree structure recursive processing (enterprise group type hierarchical tree display)

图神经网络的可解释性方法介绍和GNNExplainer解释预测的代码示例
随机推荐
AutoJs学习-多功能宝箱-下
Idea display service port --service
R语言使用epiDisplay包的aggregate函数将数值变量基于因子变量拆分为不同的子集,计算每个子集的汇总统计信息、设置na.rm参数为FALSE之后包含缺失值的分组的统计量的结果为NA
koa2 连接 mysql 数据库实现增删改查操作
R语言ggplot2可视化:使用ggpubr包的gghistogram函数可视化分组直方图、使用palette参数自定义分组直方图的条形边框颜色
C # treeview tree structure recursive processing (enterprise group type hierarchical tree display)
机械臂速成小指南(十三):关节空间轨迹规划
各厂商的数据湖解决方案
HCIA review and answer 2022.7.6
高效理解 FreeSql WhereDynamicFilter,深入了解设计初衷[.NET ORM]
Ffmpeg record video, stop (vb.net, step on the pit, class library - 10)
Analysis of the "Cyberspace Security" competition of Hunan secondary vocational group in 2022 (super detailed)
FreeRTOS个人笔记-临界值的保护
ash: /etc/apt/sources. List: insufficient permissions
Koa2 connects to MySQL database to realize the operation of adding, deleting, changing and querying
二叉树的概念及三种遍历方法(C语言)
HCIA static comprehensive experiment report 7.10
王者荣耀商城异地多活架构设计
R language uses the aggregate function of epidisplay package to divide numerical variables into different subsets based on factor variables, calculate the summary statistics of each subset, and set na
R language uses the KAP function of epidisplay package to calculate the proportion of calculation consistency of paired contingency tables and the value of kappa statistics, and uses xtabs function to