当前位置:网站首页>爬虫->TpImgspider
爬虫->TpImgspider
2022-07-26 07:29:00 【Kun Li】
爬虫其实在电商算法从业人员的应用中,其实是很广泛的,爬虫本身作为一门技术,是具有很强使用价值的。我觉得作为算法工程师,一定要会爬虫,至少基本的网页要会爬,其实爬虫本身也很简单,过于复杂的网页爬取,我自己也很少爬,基本就是二大类,一类是静态网页爬取,一类是动态ajax爬取,我自己随便写了个点代码,做视觉创意这块场爬的几个网站。
GitHub - leeguandong/TpImgspider: 爬图片工具爬图片工具. Contribute to leeguandong/TpImgspider development by creating an account on GitHub.
https://github.com/leeguandong/TpImgspider 技术这块,主要就是requests和xpath。一般的步骤就是先看network,network这块主要看xhr,xhr是异步ajax的标题,现在的一些素材网站也基本都切到ajax上面了,爬一些缩略图的有的时候静态页面也行,加上cookie之后也能爬完,做训练基本是够了,xhr中一般找到返回json的链接,preview中一般会隐藏一些参数,这些参数和主要的xhr链接进行拼接能拿到返回json的链接。对于前后端框架来说,一般进行数据交互的就是json数据格式,但是这是比较理想的方式。

当然现在很多网站一般看不出来有啥拼接链接的规律,所以一般通过selenium渲染网页来爬去,渲染网页之后通过find_elements_by_xpath拿到元素的链接,再通过requests爬取,selenium这块目前已经不支持phamejs了,无头的chorme也必须得匹配上webdriver了,chorme的webdriver我一直匹配不上,我用的是Firefox,这块也不复杂,直接把webdriver写到具体的链接上。
driver = webdriver.Firefox(executable_path=r'F:\Dataset\qiantu\geckodriver-v0.31.0-win64\geckodriver.exe')
driver.get(self.url)此外一般爬网站最好把cookie加上,否则会限制爬取,至于存储,一般就是存图片和链接。
边栏推荐
- :app:checkDebugAarMetadata 2 issues were found when checking AAR metadata: 2 issues were found when
- 从Boosting谈到LamdaMART
- How to convert multi row data into multi column data in MySQL
- 6. Backup and recovery of MySQL database
- Download and install the free version of typora
- HCIP---MPLS详解和BGP路由过滤
- C# 使用Log4Net记录日志(基础篇)
- MMOE多目标建模
- Upgrade ecological proposition: what has Alibaba cloud brought to thousands of businesses?
- 「论文笔记」Next-item Recommendations in Short Sessions
猜你喜欢

【每日一题】919. 完全二叉树插入器

PXE efficient batch network installation

C # use log4net to record logs (basic chapter)

Regression analysis code implementation

PR字幕制作
Usage of unity3d object pool

系统架构&微服务

Compose text and icon splicing to realize drawableleft or drawableright
![Leetcode:1898. maximum number of removable characters [if you want to delete some IDX from a pile of things, don't use pop]](/img/e6/a17902a73ff6a9d4393c96a019b78e.png)
Leetcode:1898. maximum number of removable characters [if you want to delete some IDX from a pile of things, don't use pop]

DADNN: Multi-Scene CTR Prediction via Domain-Aware Deep Neural Network
随机推荐
How to expand and repartition the C disk?
WCF 入门教程二
tensorflow2.x中的量化感知训练以及tflite的x86端测评
Singles cup web WP
ModuleNotFoundError: No module named ‘pip‘解决办法
Deep learning model deployment
Typora免费版下载安装
0动态规划 LeetCode1567. 乘积为正数的最长子数组长度
Speech at 2021 global machine learning conference
Devaxpress.xtraeditors.datanavigator usage
Wrong Addition
What is bloom filter in redis series?
模型剪枝三:Learning Structured Sparsity in Deep Neural Networks
[daily question 1] 919. Complete binary tree inserter
Configure flask
NLP natural language processing - Introduction to machine learning and natural language processing (3)
Redis migrate tool migration error.
Idea shortcut key
“尝鲜”元宇宙,周杰伦最佳拍档方文山将于7月25日官宣《华流元宇宙》
OVS底层实现原理