一些爬虫相关的签名、验证码破解

Overview

cracking4crawling

一些爬虫相关的签名、验证码破解,目前已有脚本:

说明:

脚本按目标网站、App命名归档,每个脚本一般都是可以单独导入使用(除非调用了额外的用于加解密的js文件),使用方法可阅读文档或参考其中的test函数。

使用方法:

小红书

小红书App接口签名(shield)

shield是小红书App接口主要的签名,由path、params、xy_common_params、xy_platform_info、data拼接并加密生成。原始加密在libshield.so中,已用python复现。

from urllib import parse

from xiaohongshu.shield import get_sign

# 对接口路径、url参数、header中的xy-common-params、xy-platform-info、请求的data进行签名
path = '/api/sns/v4/note/user/posted'

params = parse.urlencode({'user_id': '5eeb209d000000000101d84a'})

xy_common_params = parse.urlencode({})
    
xy_platform_info = parse.urlencode({})

data = parse.urlencode({})

# 生成签名
sign = get_sign(path=path, 
                params=params, 
                xy_common_params=xy_common_params, 
                xy_platform_info=xy_platform_info,
                data=data)
print(sign)

小红书滑块(数美)验证破解

小红书使用数美滑块验证码,验证过程(获取验证码配置>获取验证码>提交验证)在数美的服务器(数美使用organization来识别被验证的网站、App)上进行,完成后将通过的rid提交到小红书的接口。

具体实现细节:

  • 协议更新:数美会定期自动更新js和接口参数字段(接口里所有两个字母组成的字段名都会在更新修改),通过"/ca/v1/conf"接口返回的js路径可以判断协议版本(如"/pr/auto-build/v1.0.1-33/captcha-sdk.min.js",表示协议版本号为33),脚本会加载js,并通过匹配确认字段名,用于后续的接口请求。
  • 验证参数:验证主要需要三个参数:位移比率、时间、轨迹,使用opencv中的matchTemplate函数计算距离,并随机生成相应的轨迹。
  • 调用加密:提交验证的主要参数都需要加密,使用DES加密。
  • 加密过程:"/ca/v1/register"接口会返回一个参数k,使用"sshummei"作为key对它解密,结果为加密参数所需的key,再对参数进行加密。

注:当前的验证参数全部按照小红书App调整,用于其他验证(如小红书Web或其他网站、App),可能需要调整其中参数。

from xiaohongshu.shumei_slide_captcha import get_verify

# 表示小红书
organization = 'eR46sBuqF0fdw7KWFLYa'

# rid是验证过程中响应的标示,r是最后提交验证返回的响应
rid, r = get_verify(organization)

print(rid, r)

# riskLevel为PASS说明验证通过
if r['riskLevel'] == 'PASS':
    # 这里需要向小红书提交rid
    # 具体可抓包查看,接口:/api/sns/v1/system_service/slide_captcha_check
    pass

海南航空

海南航空App接口签名(hnairSign)

签名对象主要是请求的data,取common、data下的全部参数,按字典序排序进行拼接(list、dict类型不参与拼接),结尾加上slat,进行HMAC_SHA1加密生成。

注:"/user/"下的接口加签时,会在拼接的内容前加上token,同时HMAC_SHA1加密会使用服务器返回的secret

from hnair.hna_signature

# 对请求的data进行签名
data = {
    'common': {
        # common的内容
    },
    'data': {
        'adultCount': 1,
        'cabins': ['*'],
        'childCount': 0,
        'depDate': '2020-12-09',
        'dstCode': 'PEK',
        'infantCount': 0,
        'orgCode': 'YYZ',
        'tripType': 1,
        'type': 3
    }
}

# /user/ 路径下的接口需要登录,同时加签要传入token、secret(都由服务器返回)
# token = ''
# secret = ''

# 生成签名
sign = get_sign(data=data)
print(sign)
Owner
XNFA
XNFA
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
Extract embedded metadata from HTML markup

extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic

Scrapinghub 725 Jan 03, 2023
This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

Khaled Tofailieh 4 Feb 13, 2022
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 07, 2023
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022
Libextract: extract data from websites

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python

499 Dec 09, 2022
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye, you can search with various keywords and usernames on Twitter.

Jolanda de Koff 19 Dec 12, 2022
Web Scraping Instagram photos with Selenium by only using a hashtag.

Web-Scraping-Instagram This project is used to automatically obtain images by web scraping Instagram with Selenium in Python. The required input will

Sandro Agama 3 Nov 24, 2022
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022
A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

Yun Wang 1 Jan 10, 2022
Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

AMEY THAKUR 8 Apr 28, 2022
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

David Souza 1 Jan 12, 2022
京东茅台抢购 2021年4月最新版

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。 huanghyw 对任何脚本问题概不

45 Dec 14, 2022
Google Scholar Web Scraping

Google Scholar Web Scraping This is a python script that asks for a user to input the url for a google scholar profile, and then it writes publication

Suzan M 1 Dec 12, 2021
Demonstration on how to use async python to control multiple playwright browsers for web-scraping

Playwright Browser Pool This example illustrates how it's possible to use a pool of browsers to retrieve page urls in a single asynchronous process. i

Bernardas Ališauskas 8 Oct 27, 2022