robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Overview

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png https://travis-ci.org/jmcarp/robobrowser.png?branch=master https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

  • Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

Owner
Joshua Carp
Joshua Carp
A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022
Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Scrapy Cluster This Scrapy project uses Redis and Kafka to create a distributed

Hanh Pham Van 0 Jan 06, 2022
Get-web-images - A python code that get images from any site

image retrieval This is a python code to retrieve an image from the internet, a

CODE 1 Dec 30, 2021
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

douglasbc 48 Jan 03, 2023
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
Collection of code files to scrap different kinds of websites.

STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

Tapasweni Pathak 15 Jun 08, 2022
河南工业大学 完美校园 自动校外打卡

HAUT-checkin 河南工业大学自动校外打卡 由于github actions存在明显延迟,建议直接使用腾讯云函数 特点 多人打卡 使用简单,仅需账号密码以及用于微信推送的uid 自动获取上一次打卡信息用于打卡 向所有成员微信单独推送打卡状态 完美校园服务器繁忙时造成打卡失败会自动重新打卡

36 Oct 27, 2022
Automated Linkedin bot that will improve your visibility and increase your network.

LinkedinSpider LinkedinSpider is a small project using browser automating to increase your visibility and network of connections on Linkedin. DISCLAIM

Frederik 2 Nov 26, 2021
Google Scholar Web Scraping

Google Scholar Web Scraping This is a python script that asks for a user to input the url for a google scholar profile, and then it writes publication

Suzan M 1 Dec 12, 2021
A crawler of doubamovie

豆瓣电影 A crawler of doubamovie 一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。 spider.py start_requests方法为scrapy的方法,我们对它进行重写。 def start_requests(self):

Cats without dried fish 1 Oct 05, 2021
Lovely Scrapper

Lovely Scrapper

Tushar Gadhe 2 Jan 01, 2022
基于Github Action的定时HITsz疫情上报脚本,开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

Ter 56 Nov 27, 2022
A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

appelsiensam 4 Jul 26, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
Get paper names from dblp.org

scraper-dblp Get paper names from dblp.org and store them in a .txt file Useful for a related literature :) Install libraries pip3 install -r requirem

Daisy Lab 1 Dec 07, 2021
Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

Guilherme Silva Uchoa 3 Oct 04, 2022
Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022
👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

Ace Attorney reddit bot 👨🏼‍⚖️ Reddit bot that turns comment chains into ace attorney scenes. You'll need to sign up for streamable and reddit and se

763 Nov 17, 2022