Snowflake database loading utility with Scrapy integration

Overview

Snowflake Stage Exporter

Snowflake database loading utility with Scrapy integration.
Meant for streaming ingestion of JSON serializable objects into Snowflake stages and tables.

Installation

> pip install git+https://github.com/hermit-crab/snowflake-stage-exporter.git

Basic example

from snowflake_stage_exporter import SnowflakeStageExporter

with SnowflakeStageExporter(
    user='...',
    password='...',
    account='...',
    table_path='MY_DATABASE.PUBLIC.{item_type_name}',
) as exporter:
    exporter.export_item({'name': 'Jack', 'salary': 100}, item_type_name='employee')
    exporter.export_item({'name': 'Sal', 'salary': 90, 'extra_info': {'age': 20}}, item_type_name='employee')
    exporter.export_item({'title': 'Steel Grill', 'price': 5.5}, item_type_name='product')
    exporter.finish_export()  # flushes all stage buffers, creates tables and populates them with data inside stages

After you call finish() 2 tables will be created: EMPLOYEE (2 rows, 3 columns) and PRODUCT (1 row, 2 columns) located inside database MY_DATABASE and database schema PUBLIC (Snowflake default database schema).

Same thing achieved via Scrapy integration.

How this works

For each object that you feed into the exporter it will write it into a local buffer (temporary JSON file). Once a configurable maximum buffer size is reached the file is uploaded to Snowflake internal stage via PUT statement. Upon the end of the execution exporter will create all specified tables then instruct Snowflake to populate each table from every staged JSON file via COPY INTO

statements

  • If you output to multiple tables then a buffer is maintained for each.
  • Alternatively you can create / populate tables as soon as the buffers are flushed via *_on parameters described below.
  • *_on parameters also allow you to disable any table creation / population and just deal with the stages yourself.
  • For table creation the exporter will try to figure out column types dynamically during execution, otherwise you can pass them explicitly via parameter.

Why "Stages"?

The use of local buffers and stages opposed to typical SQL INSERT statements is motivated largely by Snowflake performance implications and their billing model (see https://community.snowflake.com/s/question/0D50Z00008JpBymSAF/implications-of-multiple-insert-statements-vs-copy-into).

An illustrative example can be a long running Scrapy / ScrapyCloud job that constantly outputs data. If the job was to keep the connection constantly executing the INSERTs - Snowflake would also keep the warehouse running / consuming the credits for the entire duration of the job.

Secondary consideration was for allowing the user to be able to work with purely just the stages like one would work with S3 or similar blob file storage. This covers cases when user would needs their own sophisticated table management approach and simply wants a convenient place to store raw data.

IMPORTANT NOTE: it won't make much sense to use this library if you're already working with S3 or similar storages (including just local machine) where your data is one of the Snowflake supported file formats. Snowflake has built-in support of ingesting several 3rd party blob storages and for local files you can upload them via PUT statements.

Configuration

All of the configurations are done via arguments to main exporter class SnowflakeStageExporter.

  • user/password/account - Snowflake account credentials, passed as is to snowflake.connector.connect.
  • connection_kwargs - any additional parameters to snowflake.connector.connect.
  • table_path - table path to use.
    • If you specify database / database schema in connection_kwargs you won't need to specify them in the table path.
    • The path can include template variables which are expanded when you feed an item to exporter. By default only item variable is passed (e.g. 'MY_DB.PUBLIC.TABLE_{item[entity_type]}' here it's assumed all of your items have "entity_type" field).
    • Any additional variables you can pass yourself as keyword arguments when calling exporter.export_item().
    • Additionally in Scrapy integration the following fields are passed:
      • spider - spider instance.
      • item_type_name - type(item).__name__. In the basic example above you passed this explicitly yourself.
  • stage - which internal stage to use. By default user stage ("@~") is used.
  • stage_path - naming for the files being uploaded to the stage.
    • By default it's "{table_path}/{instance_ms}_{batch_n}.jl" where table_path is table_path with all variables resolved, instance_ms epoch milliseconds when exporter was instantiated and batch_n being sequential number of the buffer.
    • In Scrapy integration by default this is "{table_path}/{job}/{instance_ms}_{batch_n}.jl" where job is the key of the ScrapyCloud job or "local" if spider ran locally.
  • max_file_size - maximum buffer size in bytes. 1GiB by default.
  • predefined_column_types - dictionary of table_path to Snowflake columns types for table creation.
    • e.g. {"MY_DB.PUBLIC.PRODUCT": {"title": "STRING", "price": "NUMBER"}, "MY_DB.PUBLIC.EMPLOYEE": {"name": "STRING", "salary": "NUMBER", "extra_info": "OBJECT"}}.
  • ignore_unexpected_fields - ignore fields not passed in predefined_column_types during table creation / population.
    • True by default but only takes effect when table does have predefined column types.
    • The data is still exported in full to the staged files.
  • allow_varying_value_types - if False during table creation / population skip columns that had multiple value types.
    • True by default. VARIANT type is assigned to such column.
    • Error is logged when False and such column is encountered.
    • Takes effect only when there is a need for exporter to figure out the column type.
    • The data is still exported in full to the staged files.
  • create_tables_on - one of "finish/flush/never". "finish" by default. "flush" is for each time a file is staged.
  • populate_tables_on - ditto.
  • clear_stage_on - same as above but "never" is default. Each file is removed from stage individually when enabled.

Configuration (Scrapy)

All of the exporter instance parameters are exposed as Scrapy settings like SNOWFLAKE_ (e.g. SNOWFLAKE_MAX_FILE_SIZE).

Once a Scrapy job ends, all remaining buffers are flushed. If the job outcome is not "finished" (something went wrong) then no table creation / table population / stage clear takes place.

TODO

  • Unit tests >_>.
  • Test on windows?
.
Owner
Oleg T.
Oleg T.
Scraping web pages to get data

Scraping Data Get public data and save in database This is project use Python How to run a project 1 - Clone the repository 2 - Install beautifulsoup4

Soccer Project 2 Nov 01, 2021
Lovely Scrapper

Lovely Scrapper

Tushar Gadhe 2 Jan 01, 2022
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 06, 2023
Scrap the 42 Intranet's elearning videos in a single click

42intra_scraper Scrap the 42 Intranet's elearning videos in a single click. Why you would want to use it ? Adjust speed at your convenience. (The intr

Noufel 5 Oct 27, 2022
Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Subscrape - A Python scraper for substrate chains

subscrape A Python scraper for substrate chains that uses Subscan. Usage copy co

ChaosDAO 14 Dec 15, 2022
Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

Nikolai Tschacher 88 Jan 04, 2023
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

Michelle 2 Jul 22, 2022
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022
Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-lea

Ronie Martinez 326 Dec 15, 2022
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
Python web scrapper

Website scrapper Web scrapping project in Python. Created for learning purposes. Start Install python Update configuration with websites Launch script

Nogueira Vitor 1 Dec 19, 2021
for those who dont want to pay $10/month for high school game footage with ads

nfhs-scraper Disclaimer: I am in no way responsible for what you choose to do with this script and guide. I do not endorse avoiding paywalls or any il

Conrad Crawford 5 Apr 12, 2022
A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022
A python script to extract answers to any question on Quora (Quora+ included)

quora-plus-bypass A python script to extract answers to any question on Quora (Quora+ included) Requirements Python 3.x

Nitin Narayanan 10 Aug 18, 2022