Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Related tags

Web Crawlingcrawley
Overview

Pythonic Crawling / Scraping Framework Built on Eventlet


Build Status Code Climate Stories in Ready

Features

  • High Speed WebCrawler built on Eventlet.
  • Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.
  • Supports NoSQL databased like Mongodb and Couchdb. New!
  • Export your data into Json, XML or CSV formats. New!
  • Command line tools.
  • Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).
  • Cookie Handlers.
  • Very easy to use (see the example).

Documentation

http://packages.python.org/crawley/

Project WebSite

http://project.crawley-cloud.com/


To install crawley run

~$ python setup.py install

or from pip

~$ pip install crawley

To start a new project run

~$ crawley startproject [project_name]
~$ cd [project_name]

Write your Models

""" models.py """

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):
    
    #add your table fields here
    updated = Field(Unicode(255))    
    package = Field(Unicode(255))
    description = Field(Unicode(255))

Write your Scrapers

""" crawlers.py """

from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *

class pypiScraper(BaseScraper):
    
    #specify the urls that can be scraped by this class
    matching_urls = ["%"]
    
    def scrape(self, response):
                        
        #getting the current document's url.
        current_url = response.url        
        #getting the html table.
        table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
        
        #for rows 1 to n-1
        for tr in table[1:-1]:
                        
            #obtaining the searched html inside the rows
            td_updated = tr[0]
            td_package = tr[1]
            package_link = td_package[0]
            td_description = tr[2]
            
            #storing data in Packages table
            Package(updated=td_updated.text, package=package_link.text, description=td_description.text)


class pypiCrawler(BaseCrawler):
    
    #add your starting urls here
    start_urls = ["http://pypi.python.org/pypi"]
    
    #add your scraper classes here    
    scrapers = [pypiScraper]
    
    #specify you maximum crawling depth level    
    max_depth = 0
    
    #select your favourite HTML parsing tool
    extractor = XPathExtractor

Configure your settings

""" settings.py """

import os 
PATH = os.path.dirname(os.path.abspath(__file__))

#Don't change this if you don't have renamed the project
PROJECT_NAME = "pypi"
PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)

DATABASE_ENGINE = 'sqlite'     
DATABASE_NAME = 'pypi'  
DATABASE_USER = ''             
DATABASE_PASSWORD = ''         
DATABASE_HOST = ''             
DATABASE_PORT = ''     

SHOW_DEBUG_INFO = True

Finally, just run the crawler

~$ crawley run
Owner
Juan Manuel Garcia
Pasionate Python Developer
Juan Manuel Garcia
12306抢票脚本

12306抢票脚本

罐子里的茶 457 Jan 05, 2023
script to scrape direct download links (ddls) from google drive index.

bhadoo Google Personal/Shared Drive Index scraper. A small script to scrape direct download links (ddls) of downloadable files from bhadoo google driv

sαɴᴊɪᴛ sɪɴʜα 53 Dec 16, 2022
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
Automated Linkedin bot that will improve your visibility and increase your network.

LinkedinSpider LinkedinSpider is a small project using browser automating to increase your visibility and network of connections on Linkedin. DISCLAIM

Frederik 2 Nov 26, 2021
Simply scrape / download all the media from an fansly account.

Simply scrape / download all the media from an fansly account. Providing updates as long as its continuously gaining popularity, so hit the ⭐ button!

Mika C. 334 Jan 01, 2023
A web crawler for recording posts in "sina weibo"

Web Crawler for "sina weibo" A web crawler for recording posts in "sina weibo" Introduction This script helps collect attributes of posts in "sina wei

4 Aug 20, 2022
Incredibly fast crawler designed for OSINT.

Photon Incredibly fast crawler designed for OSINT. Photon Wiki • How To Use • Compatibility • Photon Library • Contribution • Roadmap Key Features Dat

Somdev Sangwan 9.3k Jan 02, 2023
Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Twitter Turbo / Auto Claimer / Swapper Version: 1.0 Last Update: 01/26/2022 Use this at your own descretion. I've only used this on test accounts and

Underscores 6 May 02, 2022
A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022
Web scrapping

Project Setup Table of Contents Project Setup Table of Contents Run project locally Install Requirements Run script Run project locally Install Requir

Charles 3 Feb 04, 2022
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

Nikolai Tschacher 88 Jan 04, 2023
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
Minecraft Item Scraper

Minecraft Item Scraper To run, first ensure you have the BeautifulSoup module: pip install bs4 Then run, python minecraft_items.py folder-to-save-ima

Jaedan Calder 1 Dec 29, 2021
Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

WebScrapperRoBot Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup. Mark your Star ⭐ ⭐ What is Web Scraping ? Web s

Nuhman Pk 53 Dec 21, 2022
Searching info from Google using Python Scrapy

Python-Search-Engine-Scrapy || Python-爬虫-索引/利用爬虫获取谷歌信息**/ Searching info from Google using Python Scrapy /* 利用 PYTHON 爬虫获取天气信息,以及城市信息和资料**/ translatio

HONGVVENG 1 Jan 06, 2022
Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

Sheryar 10 Aug 07, 2022
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021