A Python library for automating interaction with websites.

Overview

MechanicalSoup. A Python library for automating website interaction.

Home page

https://mechanicalsoup.readthedocs.io/

Overview

A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript.

MechanicalSoup was created by M Hickford, who was a fond user of the Mechanize library. Unfortunately, Mechanize was incompatible with Python 3 until 2019 and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants Requests (for HTTP sessions) and BeautifulSoup (for document navigation). Since 2017 it is a project actively maintained by a small team including @hemberger and @moy.

Gitter Chat

Installation

Latest Version Supported Versions

PyPy3 is also supported (and tested against).

Download and install the latest released version from PyPI:

pip install MechanicalSoup

Download and install the development version from GitHub:

pip install git+https://github.com/MechanicalSoup/MechanicalSoup

Installing from source (installs the version in the current working directory):

python setup.py install

(In all cases, add --user to the install command to install in the current user's home directory.)

Documentation

The full documentation is available on https://mechanicalsoup.readthedocs.io/. You may want to jump directly to the automatically generated API documentation.

Example

From examples/expl_qwant.py, code to get the results from a Qwant search:

"""Example usage of MechanicalSoup to get the results from the Qwant
search engine.
"""

import re
import mechanicalsoup
import html
import urllib.parse

# Connect to duckduckgo
browser = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup')
browser.open("https://lite.qwant.com/")

# Fill-in the search form
browser.select_form('#search-form')
browser["q"] = "MechanicalSoup"
browser.submit_selected()

# Display the results
for link in browser.page.select('.result a'):
    # Qwant shows redirection links, not the actual URL, so extract
    # the actual URL from the redirect link:
    href = link.attrs['href']
    m = re.match(r"^/redirect/[^/]*/(.*)$", href)
    if m:
        href = urllib.parse.unquote(m.group(1))
    print(link.text, '->', href)

More examples are available in examples/.

For an example with a more complex form (checkboxes, radio buttons and textareas), read tests/test_browser.py and tests/test_form.py.

Development

Build Status Coverage Status Requirements Status Documentation Status CII Best Practices LGTM Alerts LGTM Grade

Instructions for building, testing and contributing to MechanicalSoup: see CONTRIBUTING.rst.

Common problems

Read the FAQ.

Comments
  • Submit an empty file when leaving a file input blank

    Submit an empty file when leaving a file input blank

    This is in regards to issue #250

    For the tests, I followed @moy 's train of thought :

    • they are basically a copy+paste without the creation of a temp file
    • assert value["doc"] == "" checks that the response contains an empty file

    Thought a different test definition was necessary, was I right to assume so ?

    In browser.py, I changed the continue around line 179 to something similar to what has been done in test__request_file here

    There are 2 Add no file input submit test commits : the second one is simply a clean up of some commented code. Will avoid it next time !

    I was unable to run test_browser.py due to some weird Import module error on modules that are installed, so I'm kind of Pull Requesting blindly. Does it matter that I say I'm confident in the changes though ?

    opened by senabIsShort 27
  • MechanicalSoup logo

    MechanicalSoup logo

    In the Roadmap, some artwork is requested. I asked an artistic friend to try to interpret this request, and this is what they came up with. I would love to use this as our logo (in both the README, as per the roadmap, and perhaps also as our organization icon). Before I make a PR, I just wanted to see if this was what you were going for.

    Drawing
    opened by hemberger 20
  • Tests randomly hanging on Travis-CI

    Tests randomly hanging on Travis-CI

    Every couple of Travis builds, I see one of the sub-builds hang. It happens frequently enough that I feel like I have to babysit Travis, which is not a good situation to be in. From what I can tell, this occurs under two conditions:

    1. httpbin.org is under heavy load (this occurs infrequently, but can occur for extended periods of time)
    2. flake8 hangs for some unknown reason (seems arbitrary, and rerunning almost always fixes it)

    I really want to understand 2), because for 1) we could simply rely on httpbin.org a bit less if necessary.

    opened by hemberger 18
  • Remove `name` attribute from all unused buttons on form submit

    Remove `name` attribute from all unused buttons on form submit

    I ran into a site with forms including buttons of type "button" with name attributes. Because Form.choose_submit() was only removing name from buttons of type "submit", the values for the "button" buttons were being erroneously sent on POST, thereby breaking my submission. This patch fixes the issue, even when a submit button isn't explicitly chosen.

    Note that all buttons that aren't of type "button" or "reset" function as "submit" in all major browsers and should therefore be choosable.

    opened by blackwind 16
  • Do not submit disabled <input> elements

    Do not submit disabled elements

    https://www.w3.org/TR/html52/sec-forms.html#element-attrdef-disabledformelements-disabled

    The disabled attribute is used to make the control non-interactive and to prevent its value from being submitted.

    MechanicalSoup ignores disabled attributes which should be fixed.

    Some additional notes: (from https://www.wufoo.com/html5/disabled-attribute/)

    • If the disabled attribute is set on a <fieldset>, the descendent form controls are disabled.
    • A disabled field can’t be modified, tabbed to, highlighted, or have its contents copied. Its value is also ignored when the form goes thru constraint validation.
    • The disabled value is Boolean, and therefore doesn’t need a value. But, if you must, you can include disabled="disabled".
    • Setting the value of the disabled attribute to null does not remove the effects of the attribute. Instead use removeAttribute('disabled').
    • You can target elements that are disabled with the :disabled pseudo-class. Or, if you want to specifically target the presence of the attribute, you can use input[disabled]. Similarly, you can use :enabled and input:not([disabled]) to target elements that are not disabled.
    • You do not need to include aria-disabled="true" when including the disabled attribute because disabled is already well supported. However, if you are programmatically disabling an element that is not a form control and therefore the disabled attribute does not apply, include aria-disabled="true".
    • The disabled attribute is valid for all form controls including all <input> types, <textarea>, <button>, <select>, <fieldset>, and <keygen>.
    opened by 5j9 14
  • browser.follow_link() has no way to pass kwargs to requests

    browser.follow_link() has no way to pass kwargs to requests

    As noted elsewhere, I've recently been debugging behind an SSL proxy, which requires telling requests to not verify SSL certificates. Generally I've done that with

        kwargs = { "verify": False }
        # ...
        r = br.submit_selected(**kwargs)
    

    which is fine. But it's not so fine when I need to follow a link, because browser.follow_link() uses its **kwargs for BS4's tag finding, but not for actually following the link.

    So instead of

        r = br.follow_link(text='Link anchor', **kwargs)
    

    I end up with

        link = br.find_link(text='Link anchor')
        r = br.open_relative(link['href'], **kwargs)
    

    I am not sure how to fix this. Some thoughts:

    1. If nothing changes, add some more clarity to browser.follow_link()'s documentation explaining how to work around this situation.
    2. Add kwargs-ish params to browser.follow_link(), one for BS4 and one for Requests. Of course, only one gets to be **kwargs, but at least one might be able to call browser.follow_link(text='Link anchor', requests_args=kwargs) or something.
    3. Send the same **kwargs parameter to both

    Maybe there's a better way. I guess in my case I could set this state in requests' Session object, ~which I think would be browser.session.merge_environment_settings(...)~ no, that's not right, I'm not sure how to accomplish it actually.

    opened by johnhawkinson 13
  • Replace httpbin.org with pytest-httpbin in tests

    Replace httpbin.org with pytest-httpbin in tests

    The pytest-httpbin module provides pytest support for the httpbin module (which is the code that runs the remote server http://httpbin.org). This locally spins up an internal webserver when tests are run.

    With this change, MechanicalSoup tests can be run without an internet connection. As a result, the tests run much faster.

    You may need the python{,3}-dev package on your system to pip install the pytest-httpbin module.

    deferred 
    opened by hemberger 13
  • No parser was explicitly specified

    No parser was explicitly specified

    /usr/local/lib/python3.4/dist-packages/bs4/init.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

    To get rid of this warning, change this:

    BeautifulSoup([your markup])

    to this:

    BeautifulSoup([your markup], "lxml")

    markup_type=markup_type))

    Need to use add_soup method or what?

    opened by stdex 12
  • Add get_request_kwargs to check before requesting

    Add get_request_kwargs to check before requesting

    When we use mechanicalsoup, we sometimes want to verify a request before submitting it.

    If you merge this pull request, the package will be able to provide a way for the package's users to review the request.

    This is my first pull request for this project. Please let me know if I'm missing anything.

    opened by kumarstack55 11
  • Set up LGTM and fix warnings

    Set up LGTM and fix warnings

    LGTM.com finds one issue in our code, and it seems legitimate to me (although I'm guilty for introducing it):

    https://lgtm.com/projects/g/hickford/MechanicalSoup/

    We should fix this, and configure lgtm so that it checks pull-requests.

    opened by moy 11
  • Problems calling

    Problems calling "follow_link" with "url_regex"

    Dear Dan and Matthieu,

    first things first: Thanks for conceiving and maintaining this great library. We also switched one of our implementations over from "mechanize" as per https://github.com/ip-tools/ip-navigator/commit/a26c3a8a and it worked really well.

    When doing so, we encountered a minor problem when trying to call the follow_link method with the url_regex keyword argument like

    response = self.browser.follow_link(url_regex='register/PAT_.*VIEW=pdf', headers={'Referer': result.url})
    

    This raises the exception

    TypeError: links() got multiple values for keyword argument 'url_regex'
    

    I am are currently a bit short on time, otherwise i would have submitted a pull request without further ado. Thanks a bunch for looking into this issue.

    With kind regards, Andreas.

    opened by amotl 11
  • browser.links() should return an empty list if self.page is None

    browser.links() should return an empty list if self.page is None

    I was writing a fuzzer for a cybersecurity assignment, and it crashed when it tried to find the links on a PDF file. I think it would make more sense to return that there are no links, if the page fails to parse. This seems relatively straightforward to implement.

    opened by npetrangelo 1
  • Typing annotations and typechecking with mypy or pyright?

    Typing annotations and typechecking with mypy or pyright?

    We already have basic static analysis with flake8 (and the underlying pyflakes), but using typing annotations and a static typechecker may 1) find more bugs, 2) help our users by providing completion and other smart features in their IDE.

    mypy is the historical typechecker, pyright is a more recent one which in my (very limited) experience works better (it's also the tool behind the new Python mode of VSCode). So I'd suggest pyright if we don't have arguments to choose mypy.

    For now, neither tool can typecheck the project without error, so a first step would be to add the necessary annotations to get an error-free pyright check.

    easy? 
    opened by moy 3
  • Can you build it without lxml?

    Can you build it without lxml?

    MechanicalSoup is a really nice package i have used for, but it still requires C Compiler to compile the lxml on *nix systems.

    It may be a problem to port to some platforms without C Compiler, such as Android or some minified Linux.

    Currently i used a script to build MechanicalSoup without lxml:

    #!/bin/sh
    
    # Remove lxml in requirements.txt
    sed -i '/lxml/d' requirements.txt
    
    # Use `html.parser` instead `lxml`
    sed -i "s@{'features': 'lxml'}@{'features': 'html.parser'}@g" mechanicalsoup/*.py
    
    # Fix examples and tests
    sed -i "s@\\(BeautifulSoup(.\\{1,\\}\\)'lxml'\\(.*)\\)@\1'html.parser'\[email protected]" examples/*.py tests/*.py
    

    It works well, so i think it is not a big problem...

    opened by urain39 2
  • Selecting a form that only has a class attribute

    Selecting a form that only has a class attribute

    I'm trying to get a form but it only has a class attribute and I'm continuously getting a "LinkNotFoundError". I've inspected the page and I know that I have the correct class name but it doesn't work at all and I don't see any real reference to this type of issue in the docs. I would try to get the form with BS4 but then there wouldn't be a way to select the form.

    I can attempt to get the form with BS4 then maybe add an id attribute to it then try selecting it with an id attribute?

    I'd really appreciate any help, thank you!

    question 
    opened by SilverStrings024 6
  • add_soup(): Don't match Content-type with `in`

    add_soup(): Don't match Content-type with `in`

    Don't use Python's in operator to match Content-Types, since that is a simple substring match.

    It's obviously not correct since a Content-Type string can be relatively complicated, like

    Content-Type: application/xhtml+xml; charset="utf-8"; boundary="This is not text/html"

    Although that's rather contrived, the prior test "text/html" in response.headers.get("Content-Type", "") would return True here, incorrectly.

    Also, the existance of subtypes with +'s means that using the prior test for "application/xhtml" would match the above example when it probably shouldn't.

    Instead, leverage requests's code, which comes from the Python Standard Library's cgi.py.

    Clarify that we don't implement MIME sniffing, nor X-Content-Type-Options: nosniff instead we do our own thing.


    I was looking at this code because of #373.

    I've marked this as a draft, because I'm not quite sure this is the way to go, both because of the long discursive comment, the use of a _ function from requests (versus cgi.py's parse_header()).

    Also, I'm kind of perplexed what's going on here:

                http_encoding = (
                    response.encoding
                    if 'charset' in parameters
                    else None
                )
    

    Like…why does the presence of charset=utf-8 in the Content-Type header mean that we should trust requests's encoding field? Oh, I see, it's because sometimes requests does some sniffing-ish-stuff and sometimes it doesn't (in which case it parses the Content-Type) and we need to know which, and we're backing out a conclusion about its heuristics? Probably seems like maybe we should parse it ourselves if so. idk.

    Maybe we should be doing more formal mime sniffing. And maybe we should be honoring X-Content-Type-Options: nosniff. And… … …

    I'm also not sure what kind of test coverage is really appropriate here, if anything additional. Seems like the answer shouldn't be "zero," so…

    opened by johnhawkinson 2
Releases(v1.2.0)
  • v1.2.0(Sep 17, 2022)

    Main changes

    • Added support for Python 3.10.

    • Added support for HTML form-associated elements (i.e. input elements that are associated with a form by a form attribute, but are not a child element of the form). [#380]

    Bug fixes

    • When uploading a file, only the filename is now submitted to the server. Previously, the full file path was being submitted, which exposed more local information than users may have been expecting. [#375]
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(May 29, 2021)

    Main changes

    • Dropped support for EOL Python versions: 2.7 and 3.5.

    • Increased minimum version requirement for requests from 2.0 to 2.22.0 and beautifulsoup4 from 4.4 to 4.7.

    • Use encoding from the HTTP request when no HTML encoding is specified. [#355]

    • Added the put method to the Browser class. This is a light wrapper around requests.Session.put. [#359]

    • Don't override Referer headers passed in by the user. [#364]

    • StatefulBrowser methods follow_link and download_link now support passing a dictionary of keyword arguments to requests, via requests_kwargs. For symmetry, they also support passing Beautiful Soup args in as bs4_kwargs, although any excess **kwargs are sent to Beautiful Soup as well, just as they were previously. [#368]

    Many thanks to the contributors who made this release possible!

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jan 5, 2021)

    This is the last release that will support Python 2.7. Thanks to the many contributors that made this release possible!

    Main changes:

    • Added support for Python 3.8 and 3.9.

    • StatefulBrowser has new properties page, form, and url, which can be used in place of the methods get_current_page, get_current_form and get_url respectively (e.g. the new x.page is equivalent to x.get_current_page()). These methods may be deprecated in a future release. [#175]

    • StatefulBrowser.form will raise an AttributeError instead of returning None if no form has been selected yet. Note that StatefulBrowser.get_current_form() still returns None for backward compatibility.

    Bug fixes

    • Decompose <select> elements with the same name when adding a new input element to a form. [#297]

    • The params and data kwargs passed to submit will now properly be forwarded to the underlying request for GET methods (whereas previously params was being overwritten by data). [#343]

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Aug 27, 2019)

    Main changes:

    • Changes in official python version support: added 3.7 and dropped 3.4.

    • Added ability to submit a form without updating StatefulBrowser internal state: submit_selected(..., update_state=False). This means you get a response from the form submission, but your browser stays on the same page. Useful for handling forms that result in a file download or open a new tab.

    Bug fixes

    • Improve handling of form enctype to behave like a real browser. [#242]

    • HTML type attributes are no longer required to be lowercase. [#245]

    • Form controls with the disabled attribute will no longer be submitted to improve compliance with the HTML standard. If you were relying on this bug to submit disabled elements, you can still achieve this by deleting the disabled attribute from the element in the Form object directly. [#248]

    • When a form containing a file input field is submitted without choosing a file, an empty filename & content will be sent just like in a real browser. [#250]

    • <option> tags without a value attribute will now use their text as the value. [#252]

    • The optional url_regex argument to follow_link and download_link was fixed so that it is no longer ignored. [#256]

    • Allow duplicate submit elements instead of raising a LinkNotFoundError. [#264]

    Our thanks to the many new contributors in this release!

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Sep 11, 2018)

    This release focuses on fixing bugs related to uncommon HTTP/HTML scenarios and on improving the documentation.

    Bug fixes

    • Constructing a Form instance from a bs4.element.Tag whose tag name is not form will now emit a warning, and may be deprecated in the future. [#228]

    • Breaking Change: LinkNotFoundError now derives from Exception instead of BaseException. While this will bring the behavior in line with most people's expectations, it may affect the behavior of your code if you were heavily relying on this implementation detail in your exception handling. [#203]

    • Improve handling of button submit elements. Will now correctly ignore buttons of type button and reset during form submission, since they are not considered to be submit elements. [#199]

    • Do a better job of inferring the content type of a response if the Content-Type header is not provided. [#195]

    • Improve consistency of query string construction between MechanicalSoup and web browsers in edge cases where form elements have duplicate name attributes. This prevents errors in valid use cases, and also makes MechanicalSoup more tolerant of invalid HTML. [#158]

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Feb 4, 2018)

    Main changes:

    • Added StatefulBrowser.refresh() to reload the current page with the same request. [#188]

    • StatefulBrowser.follow_link, StatefulBrowser.submit_selected() and the new StatefulBrowser.download_link now sets the Referer: HTTP header to the page from which the link is followed. [#179]

    • Added method StatefulBrowser.download_link, which will download the contents of a link to a file without changing the state of the browser. [#170]

    • The selector argument of Browser.select_form can now be a bs4.element.Tag in addition to a CSS selector. [#169]

    • Browser.submit and StatefulBrowser.submit_selected accept a larger number of keyword arguments. Arguments are forwarded to requests.Session.request. [#166]

    Internal changes:

    • StatefulBrowser.choose_submit will now ignore input elements that are missing a name-attribute instead of raising a KeyError. [#180]

    • Private methods Browser._build_request and Browser._prepare_request have been replaced by a single method Browser._request. [#166]

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Nov 2, 2017)

    Main changes:

    • We do not rely on BeautifulSoup's default choice of HTML parser. Instead, we now specify lxml as default. As a consequence, the default setting requires lxml as a dependency.

    • Python 2.6 and 3.3 are no longer supported.

    • The GitHub URL moved from https://github.com/hickford/MechanicalSoup/ to https://github.com/MechanicalSoup/MechanicalSoup. @moy and @hemberger are now officially administrators of the project in addition to @hickford, the original author.

    • We now have a documentation site: https://mechanicalsoup.readthedocs.io/. The API is now fully documented, and we have included a tutorial, several more code examples, and a FAQ.

    • StatefulBrowser.select_form can now be called without argument, and defaults to "form" in this case. It also has a new argument, nr (defaults to 0), which can be used to specify the index of the form to select if multiple forms match the selection criteria.

    • We now use requirement files. You can install the dependencies of MechanicalSoup with e.g.::

      pip install -r requirements.txt -r tests/requirements.txt

    • The Form class was restructured and has a new API. The behavior of existing code is unchanged, but a new collection of methods has been added for clarity and consistency with the set method:

      • set_input deprecates input
      • set_textarea deprecates textarea
      • set_select is new
      • set_checkbox and set_radio together deprecate check (checkboxes are handled differently by default)
    • A new Form.print_summary method allows you to write browser.get_current_form().print_summary() to get a summary of the fields you need to fill-in (and which ones are already filled-in).

    • The Form class now supports selecting multiple options in a <select multiple> element.

    Bug fixes

    • Checking checkboxes with browser["name"] = ("val1", "val2") now unchecks all checkbox except the ones explicitly specified.

    • StatefulBrowser.submit_selected and StatefulBrowser.open now reset __current_page to None when the result is not an HTML page. This fixes a bug where __current_page was still the previous page.

    • We don't error out anymore when trying to uncheck a box which doesn't have a checkbox attribute.

    • Form.new_control now correctly overrides existing elements.

    Internal changes

    • The testsuite has been further improved and reached 100% coverage.

    • Tests are now run against the local version of MechanicalSoup, not against the installed version.

    • Browser.add_soup will now always attach a soup-attribute. If the response is not text/html, then soup is set to None.

    • Form.set(force=True) creates an <input type=text ...> element instead of an <input type=input ...>.

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Oct 1, 2017)

    Main changes:

    • Browser and StatefulBrowser can now be configured to raise a LinkNotFound exception when encountering a 404 Not Found error. This is activated by passing raise_on_404=True to the constructor. It is disabled by default for backward compatibility, but is highly recommanded.

    • Browser now has a __del__ method that closes the current session when the object is deleted.

    • A Link object can now be passed to follow_link.

    • The user agent can now be customized. The default includes MechanicalSoup and its version.

    • There is now a direct interface to the cookiejar in *Browser classes ((set|get)_cookiejar methods).

    • This is the last MechanicalSoup version supporting Python 2.6 and 3.3.

    Bug fixes:

    • We used to crash on forms without action="..." fields.

    • The choose_submit method has been fixed, and the btnName argument of StatefulBrowser.submit_selected is now a shortcut for using choose_submit.

    • Arguments to open_relative were not properly forwarded.

    Internal changes:

    • The testsuite has been greatly improved. It now uses the pytest API (not only the pytest launcher) for more concise code.

    • The coverage of the testsuite is now measured with codecov.io. The results can be viewed on: https://codecov.io/gh/hickford/MechanicalSoup

    • We now have a requires.io badge to help us tracking issues with dependencies. The report can be viewed on: https://requires.io/github/hickford/MechanicalSoup/requirements/

    • The version number now appears in a single place in the source code.

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(May 7, 2017)

    Summary of changes:

    • New class StatefulBrowser, that keeps track of the currently visited page to make the calling code more concise.

    • A new launch_browser method in Browser and StatefulBrowser, that allows launching a browser on the currently visited page for easier debugging.

    • Many bug fixes.

    Release on Pypi: https://pypi.python.org/pypi/MechanicalSoup/0.7.0

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Nov 24, 2015)

Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
联通手机营业厅自动做任务、签到、领流量、领积分等。

联通手机营业厅自动完成每日任务,领流量、签到获取积分等,月底流量不发愁。 功能 沃之树领流量、浇水(12M日流量) 每日签到(1积分+翻倍4积分+第七天1G流量日包) 天天抽奖,每天三次免费机会(随机奖励) 游戏中心每日打卡(连续打卡,积分递增至最高

2k May 06, 2021
京东抢茅台,秒杀成功很多次讨论,天猫抢购,赚钱交流等。

Jd_Seckill 特别声明: 请添加个人微信:19972009719 进群交流讨论 目前群里很多人抢到【扫描微信添加群就好,满200关闭群,有喜欢薅信用卡羊毛的也可以找我交流】 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性

50 Jan 05, 2023
Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

Facebook Scraper Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key. (Currently working 2021) Setup Befo

Encore Shao 2 Dec 27, 2021
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Dec 24, 2022
WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

Michelle 2 Jul 22, 2022
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
Minecraft Item Scraper

Minecraft Item Scraper To run, first ensure you have the BeautifulSoup module: pip install bs4 Then run, python minecraft_items.py folder-to-save-ima

Jaedan Calder 1 Dec 29, 2021
👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

Ace Attorney reddit bot 👨🏼‍⚖️ Reddit bot that turns comment chains into ace attorney scenes. You'll need to sign up for streamable and reddit and se

763 Nov 17, 2022
Grab the changelog from releases on Github

release-notes-scraper This simple script can be used to grab the release notes for projects from github that do not keep a CHANGELOG, but publish thei

Dan Čermák 4 Apr 01, 2022
原神爬虫 抓取原神界面圣遗物信息

原神圣遗物半自动爬虫 说明 直接抓取原神界面中的圣遗物数据 目前只适配了背包页面的抓取 准确率:97.5%(普通通用接口,对 40 件随机圣遗物识别,统计完全正确的数量为 39) 准确率:100%(4k 屏幕,普通通用接口,对 110 件圣遗物识别,统计完全正确的数量为 110) 不排除还有小错误的

hwa 28 Oct 10, 2022
Console application for downloading images from Reddit in Python

RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

James 0 Jul 04, 2021
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Web Scraping Instagram photos with Selenium by only using a hashtag.

Web-Scraping-Instagram This project is used to automatically obtain images by web scraping Instagram with Selenium in Python. The required input will

Sandro Agama 3 Nov 24, 2022
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
哔哩哔哩爬取器:以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

Boshen Shi 3 Oct 21, 2021
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021
Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

Krishna Consciousness Society 18 Dec 14, 2022
12306抢票脚本

12306抢票脚本

罐子里的茶 457 Jan 05, 2023
Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

Joel Hanson 5 Jul 27, 2022