当前位置:网站首页>Fundamentals of crawler - basic principles of agent
Fundamentals of crawler - basic principles of agent
2022-07-19 07:15:00 【W_ chuanqi】
Personal profile
Author's brief introduction : Hello everyone , I am a W_chuanqi, A programming enthusiast
Personal home page :W_chaunqi
Stand by me : give the thumbs-up + Collection ️+ Leaving a message.
May you and I share :“ If you are in the mire , The heart is also in the mire , Then all eyes are muddy ; If you are in the mire , And I miss Kun Peng , Then you can see 90000 miles of heaven and earth .”

List of articles
The first 1 Chapter Reptile base
1.5 The basic principle of agency
In the process of being a reptile, you often encounter a situation , That is, the reptile initially works normally 、 Normal data capture , Everything looks so beautiful . However, there was a mistake in the time of a cup of tea , for example 403 Forbidden, Then open the web page and have a look , You might see “ Your IP Too high access frequency ” A hint like this . This phenomenon occurs because the website has taken some anti crawler measures . For example, the server will detect a IP The number of requests per unit time , If the number of requests exceeds the set threshold , Directly refuse to provide services , And return some error messages , This situation can be called sealing IP.
Since the server detects something IP The number of requests per unit time , Then in some way, our IP Disguise yourself , Let the server not recognize that the request is initiated by our local machine , No, we can successfully prevent blocking IP Did you? ?
An effective way to camouflage is to use agents , The usage of proxy will be described in detail later . before this , You need to understand the basic principle of agent first , How does it achieve camouflage IP What about ?
1. The basic principle
Proxy actually refers to proxy server , English is called Proxy Server, The function is to obtain network information on behalf of network users . Image point says , Agent is the transit station of network information . When the client requests a website normally , Is to send the request to Web The server ,Web The server then sends the response back to the client . Setting up a proxy server , Is to build a bridge between the client and the server , At this time, the client is not directly to Web Server initiates request , Instead, send the request to the proxy server , Then the proxy server sends the request to Web The server ,Web The response returned by the server is also forwarded by the proxy server to the client . In this way, the client can also access the web page normally , And in the process Web The reality recognized by the server P It is no longer the protection of the client , Chengsuo realized his love
loading , This is the basic principle of agency .
2. The role of agency
What is the role of agency ? We can simply list as follows .
- Break through yourself IP Access restrictions , Visit some sites that you can't visit at ordinary times .
- Visit internal resources of some units or groups . such as , Use the free proxy server of the address segment in the education network . You can download and upload all kinds of open to the education network FTP, You can also check 、 Share all kinds of materials .
- Improve access speed . Usually , The proxy server will set a large hard disk buffer , When there is a whisper of the outside world , It will also be saved to its own buffer , When other users access the same information , Take the sales interest directly from the buffer . Improved access speed .
- Hide the truth IP. Internet users can hide their own through agents IP, Be free from attack . For reptiles , Using Daiyue is to hide yourself IP, Prevent your own IP Blocked .
3. Reptile agent
For reptiles , Because the climbing speed is too fast , Therefore, you may encounter the same IP Visit the same topic too often , At this time, the website will let us enter the verification code to log in or directly block IP, This will cause great inconvenience to crawl .
Use agents to hide real IP, Let the server mistakenly think that the proxy server is requesting itself . In this way, the agent is constantly changed during the crawling process , You can avoid IP Blocked , Achieve good crawling effect .
4. Agent classification
When classifying agents , Either according to the agreement , It can also be based on the anonymity of the agent , These two classification methods are summarized as follows .
• Distinguish... According to the agreement
According to the agency's agreement , Agents can be divided into the following categories .
- FTP proxy server : Mainly for access FTP The server , There are usually Uploads 、 Download and cache functions , The port is generally 21、2121 etc. .
- HTTP proxy server : Mainly used to visit web pages , Generally, there are content filtering and caching functions , The port is generally 80、8080、3128 etc. .
- SSL/TLS agent : It is mainly used to visit encrypted websites , Generally speaking, there are SSL or TLS encryption ( The highest support 128 Bit encryption strength ), The port is generally 443.
- RTSP agent : It is mainly used for Realplayer visit Real Streaming media server , Generally, it has cache function , The port is generally 554.
- Telnet agent : It is mainly used for Telnet Remote control ( Hackers often invade computers to hide their identity ), The port is generally 23.
- POP3/SMTP agent : Mainly used for POP3/SMTP Send and receive e-mail , Generally, it has cache function , The port is generally 110/25.
- SOCKS agent : Just simply passing packets , Don't care about specific protocols and usage , So it's much faster , Generally, it has cache function , The port is generally 1080.SOCKS Agency agreements are divided into SOCKS4 and SOCKS5,SOCKS4 The agreement only supports TCP,SOCKS5 The agreement supports TCP and UDP, It also supports various authentication mechanisms 、 Server side domain name resolution, etc . Simply speaking ,SOCKS4 Able to do that. SOCKS5 Can do it , but SOCKS5 Able to do that. SOCKS4 It may not be possible .
• Distinguish... According to the degree of anonymity
According to the anonymity of the agent , Agents can be divided into the following categories .
Highly anonymous agents : Highly anonymous agents will forward packets intact , It seems that the server is really an ordinary client accessing , Records of the IP It's a proxy server IP.
Ordinary anonymous agent : Ordinary anonymous proxy will make some changes to the packet , The server may find that it is a proxy server that is accessing itself , And there is a certain probability to trace the truth of the client IP. Proxy servers usually join here HTTP Head has HTTP_VIA and HTTP_X_FORWARDED_FOR.
Transparent proxy : Transparent proxy not only changes packets , It also tells the server the truth of the client IP. In addition to using caching technology to improve browsing speed , Use content filtering to improve security , No other significant effect , The most common example is the hardware firewall in the intranet .
A spy agent : Spy agents are proxy servers created by organizations or individuals , Used to record data transmitted by users , Then study the recorded data 、 Monitoring etc. .
5. Common proxy settings
Common proxy settings are as follows .
For free agents online , It is best to use a highly anonymous proxy , You can grab all agents and filter the available agents before using , You can also further maintain an agent pool .
Use of pay agent services . There are many agents on the Internet who can pay for it , The quality is much better than free agency .
ADSL dial , Dial the number once and change it once IP, High stability , It is also a relatively effective blocking solution .
Cellular agents , The box 4G or 5G Network card and other production agents . Because there are few cases of using cellular networks as agents , Therefore, the probability of being blocked as a whole will be low , But the cost of building a cellular agent is high .
边栏推荐
- web安全(xss及csrf)
- edit关闭保存时自动生成配置文件
- Ucloud Shanghai arm cloud server evaluation
- SNN learning diary - install spikengjelly
- M matlab simulation of bit error rate using LDPC, turbo and convolutional channel coding and decoding in VBLAST cooperative MIMO system segment
- m基于Lorenz混沌自同步的混沌数字保密通信系统的FPGA实现,verilog编程实现,带MATLAB混沌程序
- Mapping rule configuration of zuul route
- M simulation of UWB MIMO radar target detection based on MATLAB, considering time reversal
- urllib库的使用
- Legendary game setup tutorial
猜你喜欢

Pytorch learning diary (II)

传奇游戏架设教程

数据保护/磁盘列阵RAID保护 IP段103.103.188.xxx

The principle of SYN Flood attack and the solution of SYN Flood Attack

论文阅读:Deep Residual Shrinkage Networksfor Fault Diagnosis

M simulation of cooperative MIMO distributed space-time coding technology based on MATLAB

My world 1.18.1 forge version open service tutorial, can install mod, with panel

快速掌握sort命令,tr命令

爬虫基础—代理的基本原理

Tianyi cloud Hangzhou virtual machine (VPS) performance evaluation
随机推荐
正则表达式
9. Account and authority
Nanny level one-stop service - self correlation to construct parent-child relationship (@jsonbackreference and @jsonmanagedreference solve circular dependency)
m基于MATLAB-GUI的GPS数据经纬度高度解析与kalman分析软件设计
网络知识-04 网络层-IPv6
What role does 5g era server play in this?
剑指Offer刷题记录——Offer 03. 数组中重复的数字
edit关闭保存时自动生成配置文件
传奇游戏架设教程
About file upload and download
搭建一个网站都需要那些东西
m基于matlab的协作mimo分布式空时编码技术的仿真
ivew 穿梭框Transfer组件高亮显示操作值
Quickly master the sort command and tr command
Paper reading: deep residual shrink networks for fault diagnosis
M matlab simulation of bit error rate using LDPC, turbo and convolutional channel coding and decoding in VBLAST cooperative MIMO system segment
企业或个人域名备案怎么弄
爬虫基础—WEB网页基础
What do you need to build a website
How do you know whether the network needs to use advanced anti DDoS server? How to choose the computer room is also very important, as well as the stability of the later business