当前位置:网站首页>Spark persistence strategy_ Cache optimization
Spark persistence strategy_ Cache optimization
2022-07-26 08:31:00 【StephenYYYou】
Catalog
Spark Persistence strategy _ Cache optimization
MEMORY_ONLY and MEMORY_AND_DISK
Spark Persistence strategy _ Cache optimization
RDD Persistence strategy of
When a RDD When frequent reuse is required ,spark Provide RDD Persistence of , By using persist()、cache() Two methods are used RDD The persistence of . As shown below :
//scala
myRDD.persist()
myRDD.cache()Why use persistence ?
because RDD1 after Action Generate a new RDD2 after , The original RDD1 Will be deleted from memory , If it needs to be reused in the next operation RDD1,Spark Will go all the way up , Reread data , Then recalculate RDD1, Then calculate . This will increase the number of disks IO And cost calculation , Persistence saves data , Wait for the next time Action When using .
cache and persist Source code
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()We can see from the source code that , cache A method is actually a call passed without parameters persis Method , So we just need to study persist The method can . Without participation persist The default parameter is StorageLevel.MEMORY_ONLY, We can take a look at the class StorageLevel Source code .
/**
* Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
* new storage levels.
*/
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
/**
* :: DeveloperApi ::
* Return the StorageLevel object with the specified name.
*/
@DeveloperApi
def fromString(s: String): StorageLevel = s match {
case "NONE" => NONE
case "DISK_ONLY" => DISK_ONLY
case "DISK_ONLY_2" => DISK_ONLY_2
case "MEMORY_ONLY" => MEMORY_ONLY
case "MEMORY_ONLY_2" => MEMORY_ONLY_2
case "MEMORY_ONLY_SER" => MEMORY_ONLY_SER
case "MEMORY_ONLY_SER_2" => MEMORY_ONLY_SER_2
case "MEMORY_AND_DISK" => MEMORY_AND_DISK
case "MEMORY_AND_DISK_2" => MEMORY_AND_DISK_2
case "MEMORY_AND_DISK_SER" => MEMORY_AND_DISK_SER
case "MEMORY_AND_DISK_SER_2" => MEMORY_AND_DISK_SER_2
case "OFF_HEAP" => OFF_HEAP
case _ => throw new IllegalArgumentException(s"Invalid StorageLevel: $s")
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
useOffHeap: Boolean,
deserialized: Boolean,
replication: Int): StorageLevel = {
getCachedStorageLevel(
new StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication))
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object without setting useOffHeap.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
deserialized: Boolean,
replication: Int = 1): StorageLevel = {
getCachedStorageLevel(new StorageLevel(useDisk, useMemory, false, deserialized, replication))
}You can see ,StorageLevel Parameters include :
| Parameters : Default | meaning |
| useDisk: Boolean | Whether to use disk for persistence |
| useMemory: Boolean | Whether to use memory for persistence |
| useOffHeap: Boolean | Whether to use JAVA Heap memory |
| deserialized: Boolean | Whether to serialize |
| replication:1 | replications ( Make fault tolerance ) |
So we can get :
- NONE: It's the default configuration
- DISK_ONLY: Only cached on disk
- DISK_ONLY_2: Only cache on disk and keep 2 Copies
- MEMORY_ONLY: Only cached in disk memory
- MEMORY_ONLY_2: Only cache in disk memory and keep 2 Copies
- MEMORY_ONLY_SER: Only cached in disk memory and serialized
- MEMORY_ONLY_SER_2: It is only cached in disk memory and serialized and maintained 2 Copies
- MEMORY_AND_DISK: Cache after memory is full , It will be cached on disk
- MEMORY_AND_DISK_2: Cache after memory is full , It will be cached on disk and kept 2 Copies
- MEMORY_AND_DISK_SER: Cache after memory is full , It will be cached on disk and serialized
- MEMORY_AND_DISK_SER_2: Cache after memory is full , It will be cached on disk and serialized , And keep 2 Copies
- OFF_HEAP: Cache away from heap memory
Serialization can be similar to compression , Easy to save storage space , But it will increase the calculation cost , Because each use requires serialization and deserialization ; The default number of copies is 1, To prevent data loss , Enhance fault tolerance ;OFF_HEAP take RDD Stored in etachyon On , Make it have lower garbage collection cost , Understanding can ;DISK_ONLY Nothing to say , The following are the main comparisons MEMORY_ONLY and MEMORY_AND_DISK.
MEMORY_ONLY and MEMORY_AND_DISK
MEMORY_ONLY:RDD Cache and memory only , Partitions that cannot fit in memory will be re read from disk and calculated when used .
MEMORY_AND_DISK: Try to save in memory , Partitions that cannot be saved will be saved on disk , The process of recalculation is avoided .
Intuitively ,MEMORY_ONLY The calculation process is also needed , Relatively low efficiency , But in fact , Because it is calculated in memory , So the recalculation time consumption is much less than that of disk IO Of , So... Is usually used by default MEMORY_ONLY. Unless the intermediate computing overhead is particularly large , Use at this time MEMORY_AND_DISK Would be a better choice .
summary

边栏推荐
- 分享高压超低噪声LDO测试结果(High Voltage Ultra-low Noise LDO)
- Use of room database in kotlin
- [endnote] detailed explanation of document template layout syntax
- Dear teachers, how can sqlserver get DDL in flinkcdc?
- QT note 2
- Vscode utility shortcut
- A little awesome, 130000 a month+
- Common Oracle functions
- Kotlin function
- Let's talk about the three core issues of concurrent programming.
猜你喜欢

vscode 实用快捷键

Team members participate in 2022 China multimedia conference

Shell programming

Let's talk about the three core issues of concurrent programming.
![[GUI] swing package (window, pop-up window, label, panel, button, list, text box)](/img/05/8e7483768a4ad2036497cac136b77d.png)
[GUI] swing package (window, pop-up window, label, panel, button, list, text box)

我,35岁了。

Prefix infix suffix expression (written conversion)

Write common API tools swagger and redoc
![[endnote] detailed explanation of document template layout syntax](/img/fd/2caf4ff846626411fe8468f870e66a.png)
[endnote] detailed explanation of document template layout syntax

关于期刊论文所涉及的一些概念汇编+期刊查询方法
随机推荐
美女裸聊一时爽,裸聊结束火葬场!
Differences and connections of previewkeydown, Keydown, keypress and Keyup in C WinForm
memorandum...
请问flink sql client 在sink表,有什么办法增大写出速率吗。通过sink表的同步时
【C语言】程序员筑基功法——《函数栈帧的创建与销毁》
sed作业
2022年全国职业院校技能大赛“网络安全”竞赛试题文件上传渗透测试答案Flag
matplotlib学习笔记
Flutter WebView three fingers rush or freeze the screen
Mysql8 dual master and dual slave +mycat2 read / write separation
2022-024arts: Longest valid bracket
awk作业
分享高压超低噪声LDO测试结果(High Voltage Ultra-low Noise LDO)
Daily Note (11) -- word formula input arbitrary matrix
BGP -- Border Gateway Protocol
vscode国内的镜像服务器加速
Uninstallation of dual systems
Fluent custom popupmenubutton
各位老师,请问在flinkcdc中,sqlserver如何获取到ddl?
Day 3 homework