当前位置:网站首页>Spark persistence strategy_ Cache optimization
Spark persistence strategy_ Cache optimization
2022-07-26 08:31:00 【StephenYYYou】
Catalog
Spark Persistence strategy _ Cache optimization
MEMORY_ONLY and MEMORY_AND_DISK
Spark Persistence strategy _ Cache optimization
RDD Persistence strategy of
When a RDD When frequent reuse is required ,spark Provide RDD Persistence of , By using persist()、cache() Two methods are used RDD The persistence of . As shown below :
//scala
myRDD.persist()
myRDD.cache()Why use persistence ?
because RDD1 after Action Generate a new RDD2 after , The original RDD1 Will be deleted from memory , If it needs to be reused in the next operation RDD1,Spark Will go all the way up , Reread data , Then recalculate RDD1, Then calculate . This will increase the number of disks IO And cost calculation , Persistence saves data , Wait for the next time Action When using .
cache and persist Source code
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()We can see from the source code that , cache A method is actually a call passed without parameters persis Method , So we just need to study persist The method can . Without participation persist The default parameter is StorageLevel.MEMORY_ONLY, We can take a look at the class StorageLevel Source code .
/**
* Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
* new storage levels.
*/
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
/**
* :: DeveloperApi ::
* Return the StorageLevel object with the specified name.
*/
@DeveloperApi
def fromString(s: String): StorageLevel = s match {
case "NONE" => NONE
case "DISK_ONLY" => DISK_ONLY
case "DISK_ONLY_2" => DISK_ONLY_2
case "MEMORY_ONLY" => MEMORY_ONLY
case "MEMORY_ONLY_2" => MEMORY_ONLY_2
case "MEMORY_ONLY_SER" => MEMORY_ONLY_SER
case "MEMORY_ONLY_SER_2" => MEMORY_ONLY_SER_2
case "MEMORY_AND_DISK" => MEMORY_AND_DISK
case "MEMORY_AND_DISK_2" => MEMORY_AND_DISK_2
case "MEMORY_AND_DISK_SER" => MEMORY_AND_DISK_SER
case "MEMORY_AND_DISK_SER_2" => MEMORY_AND_DISK_SER_2
case "OFF_HEAP" => OFF_HEAP
case _ => throw new IllegalArgumentException(s"Invalid StorageLevel: $s")
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
useOffHeap: Boolean,
deserialized: Boolean,
replication: Int): StorageLevel = {
getCachedStorageLevel(
new StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication))
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object without setting useOffHeap.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
deserialized: Boolean,
replication: Int = 1): StorageLevel = {
getCachedStorageLevel(new StorageLevel(useDisk, useMemory, false, deserialized, replication))
}You can see ,StorageLevel Parameters include :
| Parameters : Default | meaning |
| useDisk: Boolean | Whether to use disk for persistence |
| useMemory: Boolean | Whether to use memory for persistence |
| useOffHeap: Boolean | Whether to use JAVA Heap memory |
| deserialized: Boolean | Whether to serialize |
| replication:1 | replications ( Make fault tolerance ) |
So we can get :
- NONE: It's the default configuration
- DISK_ONLY: Only cached on disk
- DISK_ONLY_2: Only cache on disk and keep 2 Copies
- MEMORY_ONLY: Only cached in disk memory
- MEMORY_ONLY_2: Only cache in disk memory and keep 2 Copies
- MEMORY_ONLY_SER: Only cached in disk memory and serialized
- MEMORY_ONLY_SER_2: It is only cached in disk memory and serialized and maintained 2 Copies
- MEMORY_AND_DISK: Cache after memory is full , It will be cached on disk
- MEMORY_AND_DISK_2: Cache after memory is full , It will be cached on disk and kept 2 Copies
- MEMORY_AND_DISK_SER: Cache after memory is full , It will be cached on disk and serialized
- MEMORY_AND_DISK_SER_2: Cache after memory is full , It will be cached on disk and serialized , And keep 2 Copies
- OFF_HEAP: Cache away from heap memory
Serialization can be similar to compression , Easy to save storage space , But it will increase the calculation cost , Because each use requires serialization and deserialization ; The default number of copies is 1, To prevent data loss , Enhance fault tolerance ;OFF_HEAP take RDD Stored in etachyon On , Make it have lower garbage collection cost , Understanding can ;DISK_ONLY Nothing to say , The following are the main comparisons MEMORY_ONLY and MEMORY_AND_DISK.
MEMORY_ONLY and MEMORY_AND_DISK
MEMORY_ONLY:RDD Cache and memory only , Partitions that cannot fit in memory will be re read from disk and calculated when used .
MEMORY_AND_DISK: Try to save in memory , Partitions that cannot be saved will be saved on disk , The process of recalculation is avoided .
Intuitively ,MEMORY_ONLY The calculation process is also needed , Relatively low efficiency , But in fact , Because it is calculated in memory , So the recalculation time consumption is much less than that of disk IO Of , So... Is usually used by default MEMORY_ONLY. Unless the intermediate computing overhead is particularly large , Use at this time MEMORY_AND_DISK Would be a better choice .
summary

边栏推荐
- Status management bloc provider geTx
- Date and time function of MySQL function summary
- 【C语言】程序员筑基功法——《函数栈帧的创建与销毁》
- [C language] programmer's basic skill method - "creation and destruction of function stack frames"
- vim跨行匹配搜索
- Flutter WebView jitter
- 为什么要在时钟输出上预留电容的工位?
- Special Lecture 3 number theory + game theory learning experience (should be updated for a long time)
- 【EndNote】文献类型与文献类型缩写汇编
- The full name of flitter IDFA is identity for advertisers, that is, advertising identifiers. It is used to mark users. At present, it is most widely used for advertising, personalized recommendation,
猜你喜欢

Matplotlib learning notes

Daily Note (11) -- word formula input arbitrary matrix

Basic configuration of BGP

Mycat2 sub database and sub table

Understand microservices bit by bit

Apple's tough new rule: third-party payment also requires a percentage, and developers lose a lot!

日常一记(11)--word公式输入任意矩阵

吉他五线谱联系 茉莉花

mysql函数汇总之日期和时间函数

苹果强硬新规:用第三方支付也要抽成,开发者亏大了!
随机推荐
Daily Note (11) -- word formula input arbitrary matrix
Prefix infix suffix expression (written conversion)
Alphabetic string
22-07-14 personal training match 2 competition experience
内存管理-动态分区分配方式模拟
外卖小哥,才是这个社会最大的托底
The first ide overlord in the universe, replaced...
基础乐理 节奏联系题,很重要
Mysql8 one master one slave +mycat2 read write separation
flink oracle cdc 读取数据一直为null,有大佬知道么
Nodejs2day (modularization of nodejs, NPM download package, module loading mechanism)
CV learning notes (optical flow)
Flutter WebView jitter
B title: razlika priority queue approach
Kotlin operator
Matplotlib learning notes
2022/7/18 exam summary
22-07-12 personal training match 1 competition experience
Does flinkcdc now support sqlserver instance name connection?
Take out brother is the biggest support in this society