GoogleSpider

Crawl the information of a given keyword on Google search engine

Config

DataBase

Currently, data is stored in mongodb, and the database configuration is in line 15-19 of the setting. py file, which can be modified by yourself.

# MONGODB
MONGO_IP = "localhost"
MONGO_PORT = 27017
MONGO_DB = "Google_spider"
MONGO_USER_NAME = ""
MONGO_USER_PASS = ""

Log

LOG_NAME = os.path.basename(os.getcwd())
LOG_PATH = "log/%s.log" % LOG_NAME  # log path
LOG_LEVEL = "DEBUG"
LOG_COLOR = True  
LOG_IS_WRITE_TO_CONSOLE = True 
LOG_IS_WRITE_TO_FILE = True  
LOG_MODE = "w" 
LOG_MAX_BYTES = 10 * 1024 * 1024  # Maximum bytes
LOG_BACKUP_COUNT = 20  # Number of log files reserved
LOG_ENCODING = "utf8"  # code
OTHERS_LOG_LEVAL = "ERROR"  # leval

Spider

Download interval
- ```
SPIDER_SLEEP_TIME = [0, 1]
```
Maximum number of requests (100 by default)
- ```
SPIDER_MAX_RETRY_TIMES = 100
```
  Note
  
  If an illegal interface is encountered during crawling, an exception of 'user agent -- illegal interface' will be thrown, and then the crawler task will retry until the data is successfully crawled or more than 100 times

data structure

key	value type	example
title	str	“Donald Trump - Wikipedia”
keyword	str	“Trump"
url	str	"https://en.wikipedia.org/wiki/Donald_Trump"
text	str	Donald Trump - Wikipedia 1 hour ago · Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States ... Vice President: Mike Pence In office January 20, 2017 – January 20, 2021: In office; January 20, 2017 – January 20, 2021 Occupation: Politician; businessman; television presenter Parents: Fred Trump; Mary Anne MacLeod"

Quick start

Crawl the 3 page data with the keyword 'Trump'

from spiders.google_curl import GoogleCurl

spider = GoogleCurl('Trump', 3)
spider.start()

The first parameter is the search keyword, and the second parameter is the number of pages crawled

Crawl the information of a given keyword on Google search engine

Related tags

Overview

GoogleSpider

Config

DataBase

Log

Spider

data structure

Quick start

Owner

京东茅台抢购最新优化版本，京东秒杀，添加误差时间调整，优化了茅台抢购进程队列

Telegram group scraper tool

Instagram profile scrapper with python

Goblyn is a Python tool focused to enumeration and capture of website files metadata.

SmartScraper: 简单、自动、快捷的Python网络爬虫

🐞 Douban Movie / Douban Book Scarpy

Scrapes Every Email Address of Every Society in Every University

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

Web crawling framework based on asyncio.

Create crawler get some new products with maximum discount in banimode website

Amazon web scraping using Scrapy Framework

a high-performance, lightweight and human friendly serving engine for scrapy

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Basic-html-scraper - A complete how to of web scraping with Python for beginners

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Web-Scraping using Selenium Master

A module for CME that spiders hashes across the domain with a given hash.

A Python module to bypass Cloudflare's anti-bot page.

Scrap-mtg-top-8 - A top 8 mtg scraper using python

Haphazard scripts for scraping bitcoin/bitcoin data from GitHub