当前位置:网站首页>Fundamentals of crawlers - basic principles of multithreading and multiprocessing
Fundamentals of crawlers - basic principles of multithreading and multiprocessing
2022-07-19 07:15:00 【W_ chuanqi】
Personal profile
Author's brief introduction : Hello everyone , I am a W_chuanqi, A programming enthusiast
Personal home page :W_chaunqi
Stand by me : give the thumbs-up + Collection ️+ Leaving a message.
May you and I share :“ If you are in the mire , The heart is also in the mire , Then all eyes are muddy ; If you are in the mire , And I miss Kun Peng , Then you can see 90000 miles of heaven and earth .”

List of articles
The first 1 Chapter Reptile base
1.6 The basic principles of multithreading and multiprocessing
In a computer , We can open multiple software at the same time , For example, browsing the web at the same time 、 Listen to the music 、 Typing, etc , This is the most normal thing . But think about it , Why can computers run so much software at the same time ? This involves two nouns in computer : Multiprocess and multithreading .
Again , When writing crawler programs , In order to improve crawling efficiency , We may run multiple crawler tasks at the same time , It also involves multiprocessing and multithreading .
1. The meaning of multithreading
Speaking of multithreading , We have to say what is thread first . Speaking of threads , I have to say what is process first .
A process can be understood as a program unit that can run independently , For example, open a browser , It starts a browser process ; Open a text editor , It starts a text editor process . In a process , Can handle many things at the same time , For example, in the browser process , You can open multiple pages in multiple tabs , Some pages play music , Some pages play videos , Some web pages play animations , These tasks can be run simultaneously , Mutual interference . Why can you run so many tasks at the same time ? This leads to the concept of thread , In fact, a task corresponds to a thread .
A process is a collection of threads , A process is composed of one or more threads , Thread is the smallest unit of operation scheduling , It is the smallest running unit in the process . Take the browser process mentioned above as an example , Playing music is a thread , Playing video is also a thread . Of course , There are many other threads running simultaneously in the browser process , These threads execute concurrently or in parallel, so that the entire browser can run multiple tasks at the same time .
Understand the concept of thread , Multithreading is easy to understand . Multithreading is the simultaneous execution of multiple threads in a process , The browser process above is a typical multi-threaded .
2. Concurrency and parallelism
When it comes to multiprocessing and multithreading , I have to introduce two more nouns —— Concurrency and parallelism . We know , Run a program in the computer , The bottom layer is realized by the processor running instructions .
The processor can only execute one instruction at a time , Concurrent (concurrency) It refers to that multiple instructions corresponding to multiple threads are executed in rapid rotation . For example, a processor , It executes the thread first A For a while , Reexecution thread B For a while , Then switch back to the thread A Perform for a period of time . The speed of the processor executing instructions and switching threads are very fast , People are completely unaware that the computer also switches the context of multiple threads in this process , This makes multiple threads appear to be running at the same time . On the micro level , The processor continuously switches and executes between multiple threads , The execution of each thread must occupy a time segment of the processor , Therefore, only one thread is actually executed at the same time .
parallel (parallel) Multiple instructions are executed on multiple processors at the same time , This means that parallelism must rely on multiple processors . No matter from the macro or micro point of view , Multiple threads are executed together at the same time .
Parallelism can only exist in multiprocessor systems , So if the computer processor has only one core , It is impossible to achieve parallelism . Concurrency can exist in both single processor and multiprocessor systems , Because only one nucleus , You can achieve concurrency .
for example , The system processor needs to run multiple threads at the same time . If the system processor has only one core , Then it can only run these threads in a concurrent way . If the system processor has multiple cores , So while a core executes a thread , Another core can execute another thread , In this way, the two threads can execute in parallel . Of course , Other threads may also execute on the same core as other threads , Between them is concurrent execution . Specific ways of implementation , Depends on how the operating system schedules .
3. Multithreading is suitable for scenarios
In the process of a program , Some operations are time-consuming or need to wait , For example, wait for the return of database query results 、 Wait for the response of the web page . In this case, if you use a single thread , The processor must wait for these operations to complete before continuing to perform other operations , But in the process of waiting , The processor can obviously perform other operations . If you use multithreading , The processor can be in a waiting state when a thread , To execute other threads , So as to improve the overall implementation efficiency .
Many situations are the same as the above scenario , Threads need to wait during execution . The web crawler is a very typical example , After the crawler makes a request to the server , For some time, you have to wait for the server to return a response , This kind of task belongs to IO Intensive task . For such tasks , If we enable multithreading , Then the processor can process other threads while one thread is waiting , So as to improve the overall crawling efficiency .
But not all tasks belong to IO Intensive task , Another kind of task is called compute intensive task , It can also be called CPU Intensive task . seeing the name of a thing one thinks of its function , That is, the processor is always needed to run the task . Suppose we turn on Multithreading , The processor switches from one compute intensive task to another , Then the processor will not stop , But always busy calculating , This will not save the whole time , Because the total amount of tasks to be processed is constant . At this time, if there are too many threads , Instead, it will spend more time in the process of thread switching , Make the overall efficiency lower .
in summary , If the tasks are not all compute intensive , You can use multithreading to improve the overall execution efficiency of the program . Especially for web crawlers IO Intensive task , Using multithreading can greatly improve the overall crawling efficiency of the program .
4. The meaning of multi process
Previously, we have understood the basic concept of process , process (process) It is a running activity of a program with certain independent functions on a certain data set , It is an independent unit for resource allocation and scheduling of the system .
seeing the name of a thing one thinks of its function , Multi process is to run multiple processes at the same time . Because a process is a collection of threads , And the process is composed of one or more threads , So multi process means that there are more than or equal to the number of threads running at the same time .
5. Python Multithreading and multiprocessing in
Python in GIL The limitation of leads to whether in single core or multi-core conditions , Only one thread can run at a time , This makes Python Multithreading cannot take advantage of multi-core parallelism .
GIL Its full name is Global Interpreter Lock, It means global interpreter lock , It was designed for data security .
stay Python Multithreading , The execution mode of each thread is divided into the following three steps .
- obtain GIL.
- Execute the code of the corresponding thread .
- Release GIL.
so , If a thread wants to execute , You have to get GIL. We can GIL As a pass , And in one Python In progress ,GIL only one . If the thread can't get the pass , It's not allowed to execute . This will lead to even under multi-core conditions , One Python Multiple threads in a process can only execute one at a time .
For multiple processes , Every process has its own GIL, So in multi-core processors , Running multiple processes does not
suffer GIL Affected . in other words , Multi process can give better play to the advantages of multi-core .
however , For reptiles IO For intensive tasks , The impact of multithreading and multiprocessing is not very different . But for computing intensive tasks , because GIL The existence of ,Python The overall running efficiency of multithreading may be lower than that of single core in the case of multiple cores . and Python Compared with multithreading , The operating efficiency will be doubled compared with that of a single core in the case of multiple cores .
On the whole ,Python Multi process has more advantages than multi thread . therefore , If conditions permit , Try to use multiple processes .
In the case of multi-core, the row efficiency will be doubled compared with that of single core .
On the whole ,Python Multi process has more advantages than multi thread . therefore , If conditions permit , Try to use multiple processes .
It is worth noting that , Because process is an independent unit for system resource allocation and scheduling , Therefore, data between processes cannot be shared , For example, multiple processes cannot share a global variable , Data sharing between processes needs to be achieved by a separate mechanism .
边栏推荐
- Minecraft paper version 1.18.1 open service tutorial, my world open service tutorial, mcsmanager 9 panel use tutorial
- 剑指Offer刷题记录——Offer 04. 二维数组中的查找
- Xiaodi network security - Notes (4)
- How does the advanced anti DDoS server confirm which are malicious ip/ traffic? ip:103.88.32. XXX
- 我的世界 1.18.1 Forge版 开服教程,可装MOD,带面板
- m基于matlab的MIMO信道容量分析,对比了不同天线数量;非码本预编码SVD,GMD;码本预编码DFT,TxAA以及空间分集
- 保姆级一条龙服务——自关联构造父子级关系(@JsonBackReference和@JsonManagedReference解决循环依赖)
- 爬虫基础—代理的基本原理
- M analysis of anti-interference performance of high-speed frequency hopping communication system based on Simulink
- linux下执行shell脚本调用sql文件,传输到远程服务器
猜你喜欢

2021-10-25 浏览器兼容遇到的问题

m基于MATLAB-GUI的GPS数据经纬度高度解析与kalman分析软件设计

Pytorch learning notes (I)
How does the advanced anti DDoS server confirm which are malicious ip/ traffic? ip:103.88.32. XXX

STEAM游戏高主频i9-12900k 搭建CS:GO服务器

JS不使用async/await解决数据异步/同步问题

爬虫基础—多线程和多进程的基本原理

ivew 穿梭框Transfer组件高亮显示操作值

M simulation of 16QAM and 2DPSK communication links based on Simulink, and get the bit error rate curve by calling Simulink model through MATLAB

Minecraft paper version 1.18.1 open service tutorial, my world open service tutorial, mcsmanager 9 panel use tutorial
随机推荐
linux下执行shell脚本调用sql文件,传输到远程服务器
How to open the service of legendary mobile games? How much investment is needed? What do you need?
Xiaodi network security - Notes (4)
论文阅读:Deep Residual Shrinkage Networksfor Fault Diagnosis
Legendary game setup tutorial
Paper reading: deep residual learning in spiking neural networks
9.账户和权限
M simulation of 16QAM and 2DPSK communication links based on Simulink, and get the bit error rate curve by calling Simulink model through MATLAB
5G时代服务器在这里面起着什么作用?
Execute shell script under Linux to call SQL file and transfer it to remote server
Steam game server configuration selection IP
基于小波域的隐马尔可夫树模型的图像去噪方法的matlab实现代码
Minecraft integration package [gtnh] gray Technology: new vision server building tutorial
Sword finger offer question brushing record - offer 07 Rebuild binary tree
Sword finger offer question brushing record - offer 03 Duplicate numbers in array
CDN是什么?使用CDN有什么优势?
快速学会cut命令,uniq命令的使用
组件emit基础
我的世界1.12.2 神奇宝贝(精灵宝可梦) 开服教程
PyTorch学习日记(四)