robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Last update: Dec 27, 2022

Related tags

Web Crawling robobrowser

Overview

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png

https://travis-ci.org/jmcarp/robobrowser.png?branch=master

https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Related tags

Overview

RoboBrowser: Your friendly neighborhood web scraper

Requirements

License

Owner

Joshua Carp

A dead simple crawler to get books information from Douban.

Scrape and display grades onto the console

Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

A low-code tool that generates python crawler code based on curl or url

This script is intended to crawl license information of repositories through the GitHub API.

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Scrapy uses Request and Response objects for crawling web sites.

Incredibly fast crawler designed for OSINT.

New World Market Scraper

学习强国自动化百分百正确、瞬间答题，分值45分

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

API to parse tibia.com content into python objects.

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

Scrape all the media from an OnlyFans account - Updated regularly

Explore scraping with BeautifulSoup!

Scrap the 42 Intranet's elearning videos in a single click

A webdriver-based script for reserving Tsinghua badminton courts.

Dude is a very simple framework for writing web scrapers using Python decorators

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Related tags

Overview

RoboBrowser: Your friendly neighborhood web scraper

Requirements

License

Owner

Joshua Carp

A dead simple crawler to get books information from Douban.

Scrape and display grades onto the console

Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

A low-code tool that generates python crawler code based on curl or url

This script is intended to crawl license information of repositories through the GitHub API.

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Scrapy uses Request and Response objects for crawling web sites.

Incredibly fast crawler designed for OSINT.

New World Market Scraper

学习强国 自动化 百分百正确、瞬间答题，分值45分

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

API to parse tibia.com content into python objects.

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

Scrape all the media from an OnlyFans account - Updated regularly

Explore scraping with BeautifulSoup!

Scrap the 42 Intranet's elearning videos in a single click

A webdriver-based script for reserving Tsinghua badminton courts.

Dude is a very simple framework for writing web scrapers using Python decorators

学习强国自动化百分百正确、瞬间答题，分值45分