Explore scraping with BeautifulSoup!

Last update: Oct 05, 2022

Related tags

Overview

beautifulsoup-scrape

Explore scraping with BeautifulSoup!

Part One: Start from Shakespeare

As my professor is a poet (yes, and he teaches me data and database), he loves to give us assignments related to literature.

The start project with BeautifulSoup is scraping the first act of William Shakespeare's The Tempest.

My notebook is shakespeare-scrape.ipynb.

The code includes:

cook a soup doc, or download the html text from a webpage
search certain element like dic/p/ul, or certain attribute like class
locate certain element by .parent or .find_next_sibling()

Part Two: Develop with Supreme Court Decisions

In this case, I scrape the 2020 Supreme Court Decisions.

The notebook is guardian-and-supreme-court.ipynb.

The code includes:

use for loop to print each element in a list
find the link hidden in the attribute
save the output in a list of lists, even a three-deck list

Part Three: More practice with The Guardian

The webpage I scrape is the Best Non-Fiction Books of All Time listed by The Guardian.

The notebook is the same for Part Two!

You will find a surprise if you get the soup doc of that website. Yes! An advertisement hidden in the html!

The code is similar to the last project, but there is more:

list comprehension
list of liiiissssst

Bonus: More Real Shakespeare

In this case, I try to pull out the first 100 lines of Twelfth Night, available here.

The notebook is the same for Part Two!

It's indeed that my professor loves Shakespeare.

I had trouble with this project for a long time because it required each line to contain:

a code for act.scene.line along with whether is the stage direction
the speaker or the last person who spoke prior to the stage direction
a line or stage direction

I figured it out in a very complex way and I believe there is a better way to do it!

Explore scraping with BeautifulSoup!

Related tags

Overview

beautifulsoup-scrape

Part One: Start from Shakespeare

Part Two: Develop with Supreme Court Decisions

Part Three: More practice with The Guardian

Bonus: More Real Shakespeare

Owner

Chuqin

A python tool to scrape NFT's off of OpenSea

Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms.

A simple app to scrap data from Twitter.

Displays market info for the LUNI token on the Terra Blockchain

Complete pipeline for crawling online newspaper article.

This program scrapes information and images for movies and TV shows.

🤖 Threaded Scraper to get discord servers from disboard.org written in python3

A distributed crawler for weibo, building with celery and requests.

Scrapes Every Email Address of Every Society in Every University

feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫，以及完善的爬虫报警机制。

Works very well and you can ask for the type of image you want the scrapper to collect.

Example of scraping a paginated API endpoint and dumping the data into a DB

A web crawler for recording posts in "sina weibo"

A python module to parse the Open Graph Protocol

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

淘宝、天猫半价抢购，抢电视、抢茅台，干死黄牛党

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

A repository with scraping code and soccer dataset from understat.com.

Python Web Scrapper Project