当前位置：网站首页>Analysis and solution of application jar package conflict in yarn environment

Analysis and solution of application jar package conflict in yarn environment

2022-07-19 10:32:00 【javastart】

Hadoop The framework itself integrates many third-party JAR Package library .Hadoop The framework starts itself or is running the user's MapReduce Wait for the app , Will prioritize Hadoop Preset JAR package . In this case , When the third-party library used by the user's application already exists in Hadoop Preset directory of framework , But the versions of the two are different ,Hadoop Priority will be given to loading applications Hadoop Self preset JAR package , The result of this situation is that applications often fail to function properly .

Let's start from a practical problem we encounter in practice , analyse Hadoop on YARN In the environment ,MapReduce Program runtime JAR Related principles of package search , And give the solution JAR Ideas and methods of package conflict .

One 、 One JAR Examples of package conflicts

One of mine MR The program needs to use jackson library 1.9.13 New interface of version ：

chart 1：MR Of pom.xml, rely on jackson Of 1.9.13

But my Hadoop colony (CDH Version of hadoop-2.3.0-cdh5.1.0) Preset jackson The version is 1.8.8 Of , be located Hadoop Install under directory share/hadoop/mapreduce2/lib/ Next .

Use the following command to run my MR The program ：

hadoop jar mypackage-0.0.1-jar-with-dependencies.jar com.umeng.dp.MainClass --input=../input.pb.lzo --output=/tmp/cuiyang/output/

because MR Used in the program JsonNode.asText() Method , yes 1.9.13 Version new , stay 1.8.8 There is no , So the error report is as follows ：

…
15/11/13 18:14:33 INFO mapreduce.Job: map 0% reduce 0%
15/11/13 18:14:40 INFO mapreduce.Job: Task Id : attempt_1444449356029_0022_m_000000_0, Status : FAILED
Error: org.codehaus.jackson.JsonNode.asText()Ljava/lang/String;
…

Two 、 Clear up YARN The process by which the framework executes the application

Continue to analyze how to solve JAR Before the package conflict problem , We need to understand a very important problem first , It's the user's MR How the program works in NodeManager Running on ？ This is what we found JAR The key to the solution of package conflict problem .

This article is not an introduction YARN Framework article , Some basic YARN Our knowledge assumes that everyone already knows , Such as ResourceManager( Hereinafter referred to as RM),NodeManager( Hereinafter referred to as NM),AppMaster( Hereinafter referred to as AM),Client,Container this 5 Functions and responsibilities of the core components , And the relationship between them .

chart 2：YARN Architecture diagram

If you are right about YARN It doesn't matter if you don't understand the principle of , It will not affect the understanding of the following article . I will make a brief summary of several key knowledge that will be used in the following articles , Just understand these key points ：

From a logical point of view ,Container It can be simply understood as a run Map Task perhaps Reduce Task The process of ( Yes, of course ,AM It's actually a Container, By RM command NM Running ),YARN In order to abstract different framework applications , Designed Container This general concept ;
Container By AM towards NM Send a command to start ;
Container In fact, it's a case of Shell The process started by the script , The script will execute Java Program , To run the Map Task perhaps Reduce Task.

Okay , Let's start with MR The program is in NM Process running on .

The above said ,Map Task perhaps Reduce Task By AM Send to specified NM On , And order NM Running .NM received AM After the command , For every one of them Container Create a local directory , Download program files and resource files to NM In this directory , Then prepare to run Task, In fact, it is ready to start a Container.NM For this Container Dynamically generate a name launch_container.sh Script file for , Then execute the script file . This document is for us to see Container Is the key to how to run ！

The two lines related to this problem in the script are as follows ：

export CLASSPATH="$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:(... Omit …):$PWD/*"
exec /bin/bash -c "$JAVA_HOME/bin/java -D( Various Java Parameters ) org.apache.hadoop.mapred.YarnChild 127.0.0.1 58888 ( Other application parameters )"

First look at the 2 That's ok . original , stay YARN function MapReduce when , Every Container It's just an ordinary Java Program ,Main The program entry class is ：org.apache.hadoop.mapred.YarnChild.

We know ,JVM When loading a class , Will be based on CLASSPATH Declaration order of the middle path , Find the specified classpath in turn , Until the first target class is found, it will return , Instead of continuing to search . in other words , If two JAR Packages all have the same class , Then who declares in CLASSPATH front , Who will be loaded . This is our solution JAR The key to package conflict ！

Look at the first 1 That's ok , Just the definition JVM What you need to run CLASSPATH Variable . You can see ,YARN take Hadoop preset JAR The contents of the package are written in CLASSPATH Foremost . such , As long as it is Hadoop Preset JAR The classes contained in the package , Will take precedence over applications JAR Load classes with the same classpath in the package ！

For classes unique to applications ( namely Hadoop There is no preset class ),JVM How is it loaded ？ see CLASSPATH The end of the variable definition ："/*:$PWD/*". in other words , If Java If the class cannot be found anywhere else , Finally, you will find in the current directory .

What is the current directory ？ As mentioned above ,NM Running Container front , Would be Container Create a separate directory , Then the required resources will be put into this directory , And then run the program . This directory is for storing Container All relevant resources 、 Directory of program files , That is to say launch_container.sh The current directory where the script runs . If you execute the program , Into -libjars Parameters , So designated JAR file , It will also be copied to this directory . such ,JVM You can go through CLASSPATH Variable , Find all the... In the current directory JAR package , Then you can load user self referenced JAR It's packed .

When running an application on my computer , The directory is located in /Users/umeng/worktools/hadoop-2.3.0-cdh5.1.0/ops/tmp/hadoop-umeng/nm-local-dir/usercache/umeng/appcache/application_1444449356029_0023, The contents are as follows ( It can be configured through a configuration file , be omitted )：

chart 3：NM in Job Run time Directory

Okay , We now know why YARN Always load Hadoop Preset class And JAR package , How can we solve this problem ？ The way is ： Look at the source code ！ Find dynamic generation launch_container.sh The place of , See if it can be adjusted CLASSPATH Generation order of variables , take Job The current directory of the runtime , To adjust to CLASSPATH Foremost .

3、 ... and 、 Read the source code , solve the problem

Trace source code , Let's go deep into it , Thoroughly understand everything .

think first of , although launch_container.sh The script file is created by NM Generated , however NM Just run Task The carrier of , And really precise control Container How to run , It should be the brain of the program ：AppMaster. View source code , Sure enough, it proved our idea ：Container Of CLASSPATH, By MRApps(MapReduce Of AM) Pass to NodeManager Of ,NodeManager Write again sh Script .

MRApps Medium TaskAttemptImpl::createCommonContainerLaunchContext() Method creates a Container, After that Container Will be serialized and passed directly to NM; The implementation of this method , The calling relationship is ：createContainerLaunchContext() -> getInitialClasspath()-> MRApps.setClasspath(env, conf). First , Let's see setClasspath()：

First , Will judge userClassesTakesPrecedence, If I set this Flag, Then you won't call MRApps.setMRFrameworkClasspath(environment, conf) This method . in other words , If I set this Flag Words , Users need to set all JAR Bag CLASSPATH.

Let's see setMRFrameworkClasspath() Method ：

among ,DEFAULT_YARN_APPLICATION_CLASSPATH Put all Hadoop preset JAR The package directory . Be able to see , The framework will be used first YarnConfiguration.YARN_APPLICATION_CLASSPATH Set up CLASSPATH, If not set , Will use DEFAULT_YARN_APPLICATION_CLASSPATH.

Then from conf.getStrings() Convert the configuration string into a string array separated by commas ;Hadoop Iterate over the array , In turn, calls MRApps.addToEnvironment(environment, Environment.CLASSPATH.name(), c.trim(), conf) Set up CLASSPATH.

See here , We saw a glimmer of dawn ： By default ,MRApps Will use DEFAULT_YARN_APPLICATION_CLASSPATH As Task Default CLASSPATH. If we want to change CLASSPATH, So it seems that we need to modify YARN_APPLICATION_CLASSPATH, Let this variable not be empty .

therefore , We added the following statement to the application ：

String[] classpathArray = config.getStrings(YarnConfiguration.YARN_APPLICATION_CLASSPATH, YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH);
String cp = "$PWD/*:" +  StringUtils.join(":", classpathArray);
config.set(YarnConfiguration.YARN_APPLICATION_CLASSPATH, cp);

The above sentence means ： Get it first YARN Default settings DEFAULT_YARN_APPLICATION_CLASSPATH, Then add... At the beginning Task The current directory where the program is running , Then set it to YARN_APPLICATION_CLASSPATH Variable . such ,MRApps Creating Container when , It will change our modified 、 The current program directory takes precedence CLASSPATH, As Container Runtime CLASSPATH.

The last step , We need to rely on JAR package , Put in Task In the running directory , When loading classes like this , To load the classes we really need . So how do you do that ？ Yes , Is the use of -libjars This parameter , This has been explained before . such , The command to run the program is changed as follows ：

hadoop jar ./target/mypackage-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.umeng.dp.MainClass-libjars jackson-mapper-asl-1.9.13.jar,jackson-core-asl-1.9.13.jar --input=../input.pb.lzo --output=/tmp/cuiyang/output/

Four 、 Conclusion

In this paper , We do it through analysis Hadoop Source code , Solved a problem we encountered JAR Package conflict issues .

Even the most mature and perfect documentation manual , It is impossible to cover all the details of its products to answer all the questions of users , What's more, is Hadoop This non-profit open source framework . The advantage of open source is , When you are confused , You can turn to the source code , Find the answer to the problem by yourself . This is what teacher Hou Jie said : “ The source code in front , No secret ”.

« Last one ： YARN The ultimate explanation of memory parameters

原网站

版权声明
本文为[javastart]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207171214585496.html