LSpider 一个为被动扫描器定制的前端爬虫

Overview

LSpider

LSpider - 一个为被动扫描器定制的前端爬虫

什么是LSpider?

一款为被动扫描器而生的前端爬虫~

由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

(1) 建立在Chrome Headless基础上,将模拟点击和触发事件作为核心原理,通过设置代理将流量导出到被动扫描器。

(2) 通过内置任务+子域名api来进行发散式的爬取,目的经可能的触发对应目标域的流量。

(3) 通过RabbitMQ来进行任务管理,支持大量线程同时任务。

(4) 智能填充表单,提交表单等。

(5) 通过一些方式智能判断登录框,并反馈给使用者,使用者可以通过添加cookie的方式来完成登录。

(6) 定制了相应的Webhook接口,以供Webhook统计发送到微信。

(7) 内置了Hackerone、bugcrowd爬虫,提供账号的情况下可以一键获取某个目标的所有范围。

为什么选择LSpider?

LSpider是专门为被动扫描器定制的爬虫,许多功能都是为被动扫描器而服务的。

建立在RabbitMQ的任务管理系统相当稳定,可以长期在无人监管的情况下进行发散式的爬取。

LSpider的最佳实践是什么?

服务器1(2c4g以上): Nginx + Mysql + Mysql管理界面(phpmyadmin)

将被动扫描器的输出位置设置为web路径下,这样可以通过Web同时管理结果以及任务。

LSpider部署5线程以上,设置代理连接被动扫描器(被动扫描器可以设置专门的漏扫代理)

服务器2(非必要,但如果部署在服务器1,那么就需要更好的配置):RabbitMQ

还有什么问题?

LSpider从设计之初是为了配合像xray这种被动扫描器而诞生的,但可惜的是,在工具发展的过程中,深刻认识到爬虫是无法和被动扫描器拆分开来的。

强行将应该在被动扫描器实现的功能在爬虫端实现简直是舍本逐末,所以我们发起了另一个被动扫描器项目,如果有机会,后续还会开源出来给大家。

设计思路?

为被动扫描器量身打造一款爬虫-LSpider

Usage

安装&使用

你可以通过下面的命令来测试是否安装成功

python3 manage.py SpiderCoreBackendStart --test

值得注意的是,以下脚本可能会涉及到项目路径影响,使用前请修改相应的配置

启动LSpider webhook(默认端口2062)

./lspider_webhook.sh

启动LSpider

./lspider_start.sh

完全关闭LSpider

./lspider_stop.sh

启动被动扫描器

./xray.sh

一些关键的配置

配置说明

如何配置扫描任务 以及 其他的配置相关

其中包含了如何配置扫描任务、鉴权信息、webhook。

值得注意的是,文中提到的Cookie配置,格式为浏览器请求包复制即可。

如何配置扫描任务 以及 其他的配置相关

使用内置的hackerone、bugcrowd爬虫获取目标

使用hackerone爬虫,你需要首先配置好hackerone账号

 python3 .\manage.py HackeroneSpider {appname}

同理,bugcrowd使用

 python3 .\manage.py BugcrowdSpider {appname}

404StarLink

LSpider 是 404Team 星链计划中的一环,如果对LSpider有任何疑问又或是想要找小伙伴交流,可以参考星链计划的加群方式。

Comments
  • 使用遇到了问题

    使用遇到了问题

    [WARNING] [Thread-5] [00:33:08] [LReq.py:115] [LReq] something error, Traceback (most recent call last): File "/home/ubuntuvm/LSpider/utils/LReq.py", line 75, in get return method(url, args) File "/home/ubuntuvm/LSpider/utils/LReq.py", line 179, in getRespByChrome return self.cs.get_resp(url, cookies) File "/home/ubuntuvm/LSpider/core/chromeheadless.py", line 134, in get_resp self.add_cookie(cookies) File "/home/ubuntuvm/LSpider/core/chromeheadless.py", line 192, in add_cookie value = cookie.split('=')[1].strip() IndexError: list index out of range

    [WARNING] [Thread-5] [00:33:08] [htmlparser.py:86] [AST] something error, Traceback (most recent call last): File "/home/ubuntuvm/LSpider/core/htmlparser.py", line 42, in html_parser soup = BeautifulSoup(content, "html.parser") File "/usr/local/lib/python3.8/dist-packages/bs4/init.py", line 310, in init elif len(markup) <= 256 and ( TypeError: object of type 'bool' has no len()

    报这个错误 不知道怎么解决

    opened by 294517102 3
  • pika.exceptions.AMQPConnectionError 错误

    pika.exceptions.AMQPConnectionError 错误

    运行lspider_start.sh 提示pika.exceptions.AMQPConnectionError

    ubuntu20,python3.8,RabbitMQ 3.9.10,Erlang 24.1.7 http://ip:2062可访问,http://ip:15672可访问,且新建Virtual Hosts为lyspider。 lspider与rabbitmq位于一机,且rabbitmq使用docker,命令如下: docker run -d --hostname rabbit --name some-rabbit -p 15672:15672 rabbitmq:3-management

    image

    设置如下 image

    报错截图如下: image

    哪怕账号密码乱打然后使用docker logs rabbit-log都看不到任何相关报错,怀疑是IP/端口问题,但怎么看都不像是有问题的样子。

    没接触过RABBITMQ和相关模块,折磨一天百度谷歌无果,特此发问,感谢回复!

    opened by KagamigawaMeguri 2
  • AttributeError: 'ChromeDriver' object has no attribute 'driver'

    AttributeError: 'ChromeDriver' object has no attribute 'driver'

    第一次运行时正常,但是后面每次运行都报 [email protected]:/home/tomato/LSpider-1.0.0.1# python3 manage.py SpiderCoreBackendStart --test [INFO] [MainThread] [08:48:14] [SpiderCoreBackendStart.py:35] [Spider] start test spider. [INFO] [MainThread] [08:48:14] [rabbitmqhandler.py:39] [Monitor][INIT][Rabbitmq] New Rabbitmq link to 127.0.0.1 [INFO] [MainThread] [08:48:14] [rabbitmqhandler.py:36] [Monitor][INIT] Rabbitmq init success... [INFO] [MainThread] [08:48:14] [chromeheadless.py:100] [Chrome Headless] Proxy 127.0.0.1:7777 init [ERROR] [MainThread] [08:48:15] [chromeheadless.py:45] [Chrome Headless] ChromeDriver load error. [ERROR] [MainThread] [08:48:15] [SpiderCoreBackendStart.py:47] [Spider] something error, Traceback (most recent call last): File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 38, in init self.init_object() File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 119, in init_object desired_capabilities=desired_capabilities) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py", line 81, in init desired_capabilities=desired_capabilities) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init self.start_session(capabilities, browser_profile) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session response = self.execute(Command.NEW_SESSION, parameters) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally. (unknown error: DevToolsActivePort file doesn't exist) (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/home/tomato/LSpider-1.0.0.1/web/spider/management/commands/SpiderCoreBackendStart.py", line 40, in handle spidercore = SpiderCore(test_target_list) File "/home/tomato/LSpider-1.0.0.1/web/spider/controller/spider.py", line 239, in init self.req = LReq(is_chrome=True) File "/home/tomato/LSpider-1.0.0.1/utils/LReq.py", line 37, in init self.cs = ChromeDriver() File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 46, in init exit(0) File "/usr/lib/python3.6/_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: 0

    Exception ignored in: <bound method ChromeDriver.del of <core.chromeheadless.ChromeDriver object at 0x7f1bb6c546d8>> Traceback (most recent call last): File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 591, in del self.close_driver() File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 586, in close_driver self.driver.quit() AttributeError: 'ChromeDriver' object has no attribute 'driver'

    opened by LuckyT0mat0 2
  • Docker rabbitmq传入环境变量的特性已弃用

    Docker rabbitmq传入环境变量的特性已弃用

    rabbitmq不停报错重启,docker-compose报错信息:

    rabbitmq | error: RABBITMQ_DEFAULT_PASS is set but deprecated rabbitmq | error: RABBITMQ_DEFAULT_USER is set but deprecated rabbitmq | error: RABBITMQ_DEFAULT_VHOST is set but deprecated rabbitmq | error: deprecated environment variables detected

    图片

    官方镜像仓库描述,3.9开始确实停用了这个特性。 图片

    我在docker-compose.yml修改,指定版本3.8。看起来能解决问题。 或者作者按新版推荐的写配置文件方式改一下,嘻嘻 rabbitmq: image: rabbitmq:3.8 container_name: rabbitmq hostname: rabbitmq restart: always

    opened by go1f 0
  • docker搭建后,在lspider的docker环境中执行,如下报错,请大佬告知一下,什么原因

    docker搭建后,在lspider的docker环境中执行,如下报错,请大佬告知一下,什么原因

    /opt/LSpider # python3 manage.py SpiderCoreBackendStart --test
    [INFO] [MainThread] [03:55:17] [SpiderCoreBackendStart.py:35] [Spider] start test spider.
    [INFO] [MainThread] [03:55:17] [rabbitmqhandler.py:39] [Monitor][INIT][Rabbitmq] New Rabbitmq link to rabbitmq
    [INFO] [MainThread] [03:55:17] [rabbitmqhandler.py:36] [Monitor][INIT] Rabbitmq init success...
    [INFO] [MainThread] [03:55:17] [chromeheadless.py:100] [Chrome Headless] Proxy 127.0.0.1:7777 init
    [ERROR] [MainThread] [03:55:17] [chromeheadless.py:45] [Chrome Headless] ChromeDriver load error.
    [ERROR] [MainThread] [03:55:17] [SpiderCoreBackendStart.py:47] [Spider] something error, Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 76, in start
        stdin=PIPE)
      File "/usr/local/lib/python3.7/subprocess.py", line 800, in __init__
        restore_signals, start_new_session)
      File "/usr/local/lib/python3.7/subprocess.py", line 1551, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: '/opt/LSpider/bin/chromedriver': '/opt/LSpider/bin/chromedriver'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/LSpider/core/chromeheadless.py", line 38, in __init__
        self.init_object()
      File "/opt/LSpider/core/chromeheadless.py", line 119, in init_object
        desired_capabilities=desired_capabilities)
      File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
        self.service.start()
      File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 83, in start
        os.path.basename(self.path), self.start_error_message)
    selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/LSpider/web/spider/management/commands/SpiderCoreBackendStart.py", line 40, in handle
        spidercore = SpiderCore(test_target_list)
      File "/opt/LSpider/web/spider/controller/spider.py", line 239, in __init__
        self.req = LReq(is_chrome=True)
      File "/opt/LSpider/utils/LReq.py", line 37, in __init__
        self.cs = ChromeDriver()
      File "/opt/LSpider/core/chromeheadless.py", line 46, in __init__
        exit(0)
      File "/usr/local/lib/python3.7/_sitebuiltins.py", line 26, in __call__
        raise SystemExit(code)
    SystemExit: 0
    
    Exception ignored in: <function ChromeDriver.__del__ at 0x7f91f2b63680>
    Traceback (most recent call last):
      File "/opt/LSpider/core/chromeheadless.py", line 591, in __del__
        self.close_driver()
      File "/opt/LSpider/core/chromeheadless.py", line 586, in close_driver
        self.driver.quit()
    AttributeError: 'ChromeDriver' object has no attribute 'driver'
    
    opened by uunnsec 3
Releases(1.0.2)
Owner
Knownsec, Inc.
Knownsec, Inc.
Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

AMEY THAKUR 8 Apr 28, 2022
A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

combined-shop-scraper A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items. Features Define an

2 Dec 13, 2021
Create crawler get some new products with maximum discount in banimode website

crawler-banimode create crawler and get some new products with maximum discount in banimode website. این پروژه کوچک جهت یادگیری و کار با ابزار سلنیوم

nourollah rezaei 2 Feb 17, 2022
Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Scrapy Cluster This Scrapy project uses Redis and Kafka to create a distributed

Hanh Pham Van 0 Jan 06, 2022
Web3 Pancakeswap Sniper bot written in python3

Pancakeswap_BSC_Sniper_Bot Web3 Pancakeswap Sniper bot written in python3, Please note the license conditions! The first Binance Smart Chain sniper bo

Treading-Tigers 295 Dec 31, 2022
crypto currency scraping

SCRYPTO What ? Crypto currencies scraping (At the moment, only bitcoin and ethereum crypto currencies are supported) How ? A python script is running

15 Sep 01, 2022
Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

Carmelo Gonzales 71 Oct 04, 2022
京东秒杀商品抢购Python脚本

Jd_Seckill 非常感谢原作者 https://github.com/zhou-xiaojun/jd_mask 提供的代码 也非常感谢 https://github.com/wlwwu/jd_maotai 进行的优化 主要功能 登陆京东商城(www.jd.com) cookies登录 (需要自

Andy Zou 1.5k Jan 03, 2023
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

Remax Alghamdi 19 Nov 17, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
Console application for downloading images from Reddit in Python

RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

James 0 Jul 04, 2021
Works very well and you can ask for the type of image you want the scrapper to collect.

Works very well and you can ask for the type of image you want the scrapper to collect. Also follows a specific urls path depending on keyword selection.

Memo Sim 1 Feb 17, 2022
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022
A Spider for BiliBili comments with a simple API server.

BiliComment A spider for BiliBili comment. Spider Usage Put config.json into config directory, and then python . ./config/config.json. A example confi

Hao 3 Jul 05, 2021
联通手机营业厅自动做任务、签到、领流量、领积分等。

联通手机营业厅自动完成每日任务,领流量、签到获取积分等,月底流量不发愁。 功能 沃之树领流量、浇水(12M日流量) 每日签到(1积分+翻倍4积分+第七天1G流量日包) 天天抽奖,每天三次免费机会(随机奖励) 游戏中心每日打卡(连续打卡,积分递增至最高

2k May 06, 2021
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Introduction to WebScraping Workshop - Semcomp 24 Beta

Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

Luísa Moura 19 Sep 11, 2022
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Jan 24, 2022