当前位置:网站首页>Fundamentals of crawlers - basic principles of multithreading and multiprocessing
Fundamentals of crawlers - basic principles of multithreading and multiprocessing
2022-07-19 07:15:00 【W_ chuanqi】
Personal profile
Author's brief introduction : Hello everyone , I am a W_chuanqi, A programming enthusiast
Personal home page :W_chaunqi
Stand by me : give the thumbs-up + Collection ️+ Leaving a message.
May you and I share :“ If you are in the mire , The heart is also in the mire , Then all eyes are muddy ; If you are in the mire , And I miss Kun Peng , Then you can see 90000 miles of heaven and earth .”

List of articles
The first 1 Chapter Reptile base
1.6 The basic principles of multithreading and multiprocessing
In a computer , We can open multiple software at the same time , For example, browsing the web at the same time 、 Listen to the music 、 Typing, etc , This is the most normal thing . But think about it , Why can computers run so much software at the same time ? This involves two nouns in computer : Multiprocess and multithreading .
Again , When writing crawler programs , In order to improve crawling efficiency , We may run multiple crawler tasks at the same time , It also involves multiprocessing and multithreading .
1. The meaning of multithreading
Speaking of multithreading , We have to say what is thread first . Speaking of threads , I have to say what is process first .
A process can be understood as a program unit that can run independently , For example, open a browser , It starts a browser process ; Open a text editor , It starts a text editor process . In a process , Can handle many things at the same time , For example, in the browser process , You can open multiple pages in multiple tabs , Some pages play music , Some pages play videos , Some web pages play animations , These tasks can be run simultaneously , Mutual interference . Why can you run so many tasks at the same time ? This leads to the concept of thread , In fact, a task corresponds to a thread .
A process is a collection of threads , A process is composed of one or more threads , Thread is the smallest unit of operation scheduling , It is the smallest running unit in the process . Take the browser process mentioned above as an example , Playing music is a thread , Playing video is also a thread . Of course , There are many other threads running simultaneously in the browser process , These threads execute concurrently or in parallel, so that the entire browser can run multiple tasks at the same time .
Understand the concept of thread , Multithreading is easy to understand . Multithreading is the simultaneous execution of multiple threads in a process , The browser process above is a typical multi-threaded .
2. Concurrency and parallelism
When it comes to multiprocessing and multithreading , I have to introduce two more nouns —— Concurrency and parallelism . We know , Run a program in the computer , The bottom layer is realized by the processor running instructions .
The processor can only execute one instruction at a time , Concurrent (concurrency) It refers to that multiple instructions corresponding to multiple threads are executed in rapid rotation . For example, a processor , It executes the thread first A For a while , Reexecution thread B For a while , Then switch back to the thread A Perform for a period of time . The speed of the processor executing instructions and switching threads are very fast , People are completely unaware that the computer also switches the context of multiple threads in this process , This makes multiple threads appear to be running at the same time . On the micro level , The processor continuously switches and executes between multiple threads , The execution of each thread must occupy a time segment of the processor , Therefore, only one thread is actually executed at the same time .
parallel (parallel) Multiple instructions are executed on multiple processors at the same time , This means that parallelism must rely on multiple processors . No matter from the macro or micro point of view , Multiple threads are executed together at the same time .
Parallelism can only exist in multiprocessor systems , So if the computer processor has only one core , It is impossible to achieve parallelism . Concurrency can exist in both single processor and multiprocessor systems , Because only one nucleus , You can achieve concurrency .
for example , The system processor needs to run multiple threads at the same time . If the system processor has only one core , Then it can only run these threads in a concurrent way . If the system processor has multiple cores , So while a core executes a thread , Another core can execute another thread , In this way, the two threads can execute in parallel . Of course , Other threads may also execute on the same core as other threads , Between them is concurrent execution . Specific ways of implementation , Depends on how the operating system schedules .
3. Multithreading is suitable for scenarios
In the process of a program , Some operations are time-consuming or need to wait , For example, wait for the return of database query results 、 Wait for the response of the web page . In this case, if you use a single thread , The processor must wait for these operations to complete before continuing to perform other operations , But in the process of waiting , The processor can obviously perform other operations . If you use multithreading , The processor can be in a waiting state when a thread , To execute other threads , So as to improve the overall implementation efficiency .
Many situations are the same as the above scenario , Threads need to wait during execution . The web crawler is a very typical example , After the crawler makes a request to the server , For some time, you have to wait for the server to return a response , This kind of task belongs to IO Intensive task . For such tasks , If we enable multithreading , Then the processor can process other threads while one thread is waiting , So as to improve the overall crawling efficiency .
But not all tasks belong to IO Intensive task , Another kind of task is called compute intensive task , It can also be called CPU Intensive task . seeing the name of a thing one thinks of its function , That is, the processor is always needed to run the task . Suppose we turn on Multithreading , The processor switches from one compute intensive task to another , Then the processor will not stop , But always busy calculating , This will not save the whole time , Because the total amount of tasks to be processed is constant . At this time, if there are too many threads , Instead, it will spend more time in the process of thread switching , Make the overall efficiency lower .
in summary , If the tasks are not all compute intensive , You can use multithreading to improve the overall execution efficiency of the program . Especially for web crawlers IO Intensive task , Using multithreading can greatly improve the overall crawling efficiency of the program .
4. The meaning of multi process
Previously, we have understood the basic concept of process , process (process) It is a running activity of a program with certain independent functions on a certain data set , It is an independent unit for resource allocation and scheduling of the system .
seeing the name of a thing one thinks of its function , Multi process is to run multiple processes at the same time . Because a process is a collection of threads , And the process is composed of one or more threads , So multi process means that there are more than or equal to the number of threads running at the same time .
5. Python Multithreading and multiprocessing in
Python in GIL The limitation of leads to whether in single core or multi-core conditions , Only one thread can run at a time , This makes Python Multithreading cannot take advantage of multi-core parallelism .
GIL Its full name is Global Interpreter Lock, It means global interpreter lock , It was designed for data security .
stay Python Multithreading , The execution mode of each thread is divided into the following three steps .
- obtain GIL.
- Execute the code of the corresponding thread .
- Release GIL.
so , If a thread wants to execute , You have to get GIL. We can GIL As a pass , And in one Python In progress ,GIL only one . If the thread can't get the pass , It's not allowed to execute . This will lead to even under multi-core conditions , One Python Multiple threads in a process can only execute one at a time .
For multiple processes , Every process has its own GIL, So in multi-core processors , Running multiple processes does not
suffer GIL Affected . in other words , Multi process can give better play to the advantages of multi-core .
however , For reptiles IO For intensive tasks , The impact of multithreading and multiprocessing is not very different . But for computing intensive tasks , because GIL The existence of ,Python The overall running efficiency of multithreading may be lower than that of single core in the case of multiple cores . and Python Compared with multithreading , The operating efficiency will be doubled compared with that of a single core in the case of multiple cores .
On the whole ,Python Multi process has more advantages than multi thread . therefore , If conditions permit , Try to use multiple processes .
In the case of multi-core, the row efficiency will be doubled compared with that of single core .
On the whole ,Python Multi process has more advantages than multi thread . therefore , If conditions permit , Try to use multiple processes .
It is worth noting that , Because process is an independent unit for system resource allocation and scheduling , Therefore, data between processes cannot be shared , For example, multiple processes cannot share a global variable , Data sharing between processes needs to be achieved by a separate mechanism .
边栏推荐
- regular expression
- cookie、session的配置和使用
- 数据分析及可视化——京东上销量最高的鞋子
- Sword finger offer question brushing record - offer 06 Print linked list from end to end
- SNN learning diary - install spikengjelly
- 闭包与装饰器
- 快速掌握sort命令,tr命令
- Sword finger offer question brushing record - offer 04 Search in two-dimensional array
- Tianyi cloud Hangzhou virtual machine (VPS) performance evaluation
- m基于Lorenz混沌自同步的混沌数字保密通信系统的FPGA实现,verilog编程实现,带MATLAB混沌程序
猜你喜欢

Debug wechat one hop under linxu (Fedora 27)

M FPGA implementation of chaotic digital secure communication system based on Lorenz chaotic self synchronization, Verilog programming implementation, with MATLAB chaotic program

Arm server building my world (MC) version 1.18.2 private server tutorial

IP103.53.125.xxx IP地址段 详解

我的世界 1.18.1 Forge版 开服教程,可装MOD,带面板
![[ restartedMain] o.s.b.d.LoggingFailureAnalysisReporter :](/img/dd/054af819c8bdca31bd135495386fb4.png)
[ restartedMain] o.s.b.d.LoggingFailureAnalysisReporter :

m基于Lorenz混沌自同步的混沌数字保密通信系统的FPGA实现,verilog编程实现,带MATLAB混沌程序

Performance evaluation and comparison of lightweight application servers of major cloud service manufacturers, Alibaba cloud, Tencent cloud, Huawei cloud, and ucloud

How to open the service of legendary mobile games? How much investment is needed? What do you need?

express
随机推荐
My world 1.18.1 forge version open service tutorial, can install mod, with panel
爬虫基础—Session和Cookie
What do you need to build a website
ArraysList方法
PyTorch学习日记(三)
Évaluation des performances de la machine virtuelle Tianyi Cloud Hangzhou (VPS)
TypeScript(一)
m基于Lorenz混沌自同步的混沌数字保密通信系统的FPGA实现,verilog编程实现,带MATLAB混沌程序
cookie、session的配置和使用
高防服务器是如何确认哪些是恶意IP/流量?ip:103.88.32.XXX
SNN学习日记——安装SpikingJelly
How does legend open its service? What do you need to prepare to open legend private server?
我的世界1.12.2 神奇宝贝(精灵宝可梦) 开服教程
SNN learning diary - install spikengjelly
m基于matlab的超宽带MIMO雷达对目标的检测仿真,考虑时间反转
Nanny level one-stop service - self correlation to construct parent-child relationship (@jsonbackreference and @jsonmanagedreference solve circular dependency)
ACK攻击是什么意思?ACK攻击怎么防御
Configure raspberry pie 3b+ build a personal website
1.服务器是什么?
快速理解重定向