A webmining CLI tool & library for python.

Overview

Build Status DOI download number

Minet

minet is a webmining command line tool & library for python (>= 3.6) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, CrowdTangle, YouTube, Twitter, Media Cloud etc.

It adopts a very simple approach to various webmining problems by letting you perform a variety of actions from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.

In addition, minet also exposes its high-level programmatic interface as a python library so you can tweak its behavior at will.

Shortcuts: Command line documentation, Python library documentation.

Summary

What it does

Minet can single-handedly:

  • Extract URLs from a text file (or a table)
  • Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)
  • Join two CSV files by matching the columns containing URLs
  • From a list of URLs, resolve their redirections
    • ...and check their HTTP status
    • ...and download the HTML
    • ...and extract hyperlinks
    • ...and extract the text content and other metadata (title...)
    • ...and scrape structured data (using a declarative language to define your heuristics)
  • Crawl (using a declarative language to define a browsing behavior, and what to harvest)
  • Mine or search:
  • Scrape (without requiring special access):
  • Grab & dump cookies from your browser
  • Dump Hyphe data

Documented use cases

Features (from a technical standpoint)

  • Multithreaded, memory-efficient fetching from the web.
  • Multithreaded, scalable crawling using a comfy DSL.
  • Multiprocessed raw text content extraction from HTML pages.
  • Multiprocessed scraping from HTML pages using a comfy DSL.
  • URL-related heuristics utilities such as extraction, normalization and matching.
  • Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

Don't trust us enough to pipe the result of a HTTP request into bash? We wouldn't either, so feel free to read the installation script here and run it on your end if you prefer.

On ubuntu & similar you might need to install curl and unzip before running the installation script if you don't already have it:

sudo apt-get install curl unzip

Else, minet can be installed directly as a python CLI tool and library using pip:

pip install minet

If you need more help to install and use minet from scratch, you can check those installation documents.

Finally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release here.

Upgrading

To upgrade the standalone version, simply run the install script once again:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

To upgrade the python version you can use pip thusly:

pip install -U minet

Uninstallation

To uninstall the standalone version:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash

To uninstall the python version:

pip uninstall minet

Documentation

Contributing

To contribute to minet you can check out this documentation.

How to cite

minet is published on Zenodo as DOI

You can cite it thusly:

Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, & Amélie Pellé. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399

Comments
  • casanova.exceptions.EmptyFileError

    casanova.exceptions.EmptyFileError

    I am trying to run minet in a github action. It fails with the following message:

      minet tw scrape tweets -o tweets.csv "from:@taniki #tutotal2022"
      shell: /usr/bin/bash -e {0}
      env:
        pythonLocation: /opt/hostedtoolcache/Python/3.9.5/x64
        LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.5/x64/lib
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 151, in __init__
        fieldnames = next(self.reader)
    StopIteration
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/__init__.py", line 33, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/scrape.py", line 45, in twitter_scrape_action
        enricher = casanova.enricher(
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/enricher.py", line 31, in __init__
        super().__init__(input_file, no_headers=no_headers, **kwargs)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 157, in __init__
        raise EmptyFileError
    casanova.exceptions.EmptyFileError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Error: Process completed with exit code 1.
    
    opened by taniki 16
  • Get Retweeters

    Get Retweeters

    Hi, thanks for the last release, I'm glad to see there is a Retweeters tool but I went through some issues with it... for a few days.. I may not understood how it should implemented ? I run it and I get this error : image May someone who manage with it help me ?

    Thank you

    opened by jlbreeeez 15
  • Twitter API scraper: acquire guest_token by API

    Twitter API scraper: acquire guest_token by API

    new method to acquire the guest_token through activate API relates #384 #382

    Method taken from @JustAnotherArchivist in snscrape see: https://github.com/JustAnotherArchivist/snscrape/commit/0336ce13edbd195b3e91487061a0e7a2857f0c68 Thanks for sharing the solution.

    For now this edit is simply a new method to acquire the token. The token is used as a cookie as before but it's not preserved on disk in case of multiple calls.

    opened by paulgirard 11
  • tw scrape fails on some queries due to Over capacity error

    tw scrape fails on some queries due to Over capacity error

    minet tw scrape tweets '#5gcovid' > tweets.csv

    <class 'minet.twitter.exceptions.TwitterPublicAPIInvalidResponseError'>

    {'errors': [{'message': 'Over capacity', 'code': 130}]} 503

    bug 
    opened by Yomguithereal 10
  • [retweeters] KeyError: 'url'

    [retweeters] KeyError: 'url'

    Hi, when I try to retrieve the retweeters list from a file containing tweets previously extracted from Twitter using minet scrapper, I get this error after scanning a few tweets from my list (after 7, 10, or 30 tweets scanned... it depend of the database...). Does anyone encountered this error before ? Thanks for helping :-) image

    opened by tloops329384 8
  • impossible d'extraire totalité des tweets d'une requête

    impossible d'extraire totalité des tweets d'une requête

    Lorsque je lance une requête, avec comme critère un mot clé + un utilisateur, le résultat est très aléatoire : une fois 0 tweet, une fois 1 tweet, une fois 20 tweets, une fois 80 tweets etc sans jamais arriver à une extraction totale (qui est d'environ seulement 200 tweets pourtant). J'ai relancé cette requête de nombreuses fois, sans jamais extraire l'ensemble des tweets en question.

    Que dois-je faire pour y parvenir ? Merci

    opened by parisGH 8
  • [twitter] unable to get user tweets

    [twitter] unable to get user tweets

    Hello,

    Thanks for sharing the lib with the community. I am not able to get user tweets , I got the error:

    Traceback (most recent call last):
      File "/home/bafou/.local/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/__main__.py", line 198, in main
        to_close = resolve_arg_dependencies(cli_args, config)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 290, in resolve_arg_dependencies
        setattr(cli_args, name, value.resolve(config))
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 253, in resolve
        return getpath(config, self.key, self.default)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/ebbe/utils.py", line 72, in getpath
        target = target[step]
    TypeError: string indices must be integers
    

    when executingminet tw user-tweets screen_name users.csv > tweets.csv with users.csv

    Regards.

    bug 
    opened by billmetangmo 6
  • GH actions + Minet Scrap Twitter fail.

    GH actions + Minet Scrap Twitter fail.

    hi,

    i have this GH action to generate a twitter scrap csv (written by @taniki) :

    name: scrape bfm
    
    on:
      workflow_dispatch:
      schedule:
        - cron:  '0 9 * * *'
    
    jobs:
      scrape_bfm:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/[email protected]
          - uses: actions/[email protected]
            with:
              python-version: '3.x'
          - name: install minet
            run: |
              python -m pip install --upgrade pip
              pip install minet==0.56.2
          - name: scrape @BFMTV tweets
            shell: bash
            run: |
              minet tw scrape tweets "from:@BFMTV since:2021-09-01" > bfmtv-tweets.csv
          - name: commit
            uses: ./.github/actions/commit
            with:
              message: lol @bfmtv
    

    Sometimes, no problem. Sometimes, GH return error log :

    Run minet tw scrape tweets "from:@CNEWS since:2021-09-01" > cnews-tweets.csv
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                            
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                   
    Searching for "from:@CNEWS since:2021-09-01"
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.10.1/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/__init__.py", line 31, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/scrape.py", line 69, in twitter_scrape_action
        for tweet, meta in iterator:
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 370, in search
        new_cursor, tweets = retryer(self.request_search, query, cursor, refs=refs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
        do = self.iter(retry_state=retry_state)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 349, in iter
        return fut.result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 438, in result
        return self.__get_result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
        raise self._exception
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 407, in __call__
        result = fn(*args, **kwargs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 72, in wrapped
        self.acquire_guest_token()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 261, in acquire_guest_token
        raise TwitterGuestTokenError
    minet.twitter.exceptions.TwitterGuestTokenError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]
    Error: Process completed with exit code 1.
    

    Dont understand. Did anyone have the same problem Twitter ban GH sometimes ?

    Thanks for Minet, super outil !

    opened by stefw 6
  • Access denied

    Access denied

    Forewords : sorry, new on GitHub, and I'm not sure it is the appropriate place to post my question... Is it ?

    Hi, First, thank you for the tool which will help me a lot in my research ! I got a problem, which I think is not that complicated, but when I run Minet in order to get the "friends" of the twitter_users contained in the data_users.csv file, I don't manage to get access to the file : "Permission Denied"... I tried to open the CMD as an Administrator but it didn't solve the problem. Can you help me ?

    Capture

    opened by jlbreeeez 6
  • error in installing pip install mineit

    error in installing pip install mineit

    while installing mineit via pip it does not work. says, "" Collecting mineit Could not install packages due to an EnvironmentError: 404 Client Error: Not Found for url: https://pypi.org/simple/mineit/

    ""

    is this issue already solved?

    opened by moonisali 6
  • Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    As in #382 I experience systematic TwitterGuestTokenError exceptions. Was not the case a few weeks ago. I didn't test other versions than 0.56.1 and 0.56.2.

    Looks like we need to review the twitter scrape heuristic. I will try to have a look later today or tomorrow.

    bug 
    opened by paulgirard 5
  • instagram

    instagram

    • [ ] get comments from a post id: https://www.instagram.com/api/v1/media/POST_ID/comments/?can_support_threading=true&permalink_enabled=false
    • [x] get user info from username: https://i.instagram.com/api/v1/users/web_profile_info/?username=USERNAME
    • [ ] other route for posts associated with hashtag (more info but don't know how to change page): https://www.instagram.com/api/v1/tags/web_info/?tag_name=HASHTAG
    • [ ] get post info from post id: https://www.instagram.com/api/v1/media/POST_ID/info/
    • [ ] get post likers from post id (it seems that we can only have access to a limited number of them): https://www.instagram.com/api/v1/media/POST_ID/likers/

    Need 'cookie' and 'x-ig-app-id'

    enhancement 
    opened by MiguelLaura 0
Releases(0.66.1)
Owner
médialab Sciences Po
SciencesPo's médialab is an interdisciplinary research laboratory gathering engineers, designers & social science researchers.
médialab Sciences Po
Shazam is a Command Line Application that checks the integrity of the file by comparing it with a given hash.

SHAZAM - Check the file's integrity Shazam is a Command Line Application that checks the integrity of the file by comparing it with a given hash. Crea

Anaxímeno Brito 1 Aug 21, 2022
Gamma ion pump QPC ethernet Python library & CLI utility

Unofficial Gamma ion pump ethernet control CLI utility and library This is a mini Python 3 library and utility that exposes some of the functions of t

2 Jul 18, 2022
Command line tool for google dorks

CLI for google dorks This is the command line tool made with pytohn which allows the users to perform Google dorks easily Installation Install google

subrahmanya s hegade 3 Feb 08, 2022
vimBrain is a brainfuck-based vim-inspired esoteric programming language.

vimBrain vimBrain is a brainfuck-based vim-inspired esoteric programming language. vimBrainPy Currently, the only interpreter available is written in

SalahDin Ahmed 3 May 08, 2022
A CLI for advanced management of your notes with simple commands

PyNoteManager This is a CLI for advanced management of your notes with simple co

3 Dec 30, 2021
CryptoCo-py is a Python CLI application that uses CoinGecko API to allow the user to query cryptocurrency information by typing simple commands.

CryptoCo-py is a Python CLI application that uses CoinGecko API to allow the user to query cryptocurrency information by typing simple com

1 Jan 10, 2022
Faza - Faza terminal, Faza help to beginners for pen testing

Faza terminal simple tool for pen testers Use small letter only for commands Don't use space after command 'help' for more information Installation gi

Ag3ntQ 5 Feb 20, 2022
Salesforce object access auditor

Salesforce object access auditor Released as open source by NCC Group Plc - https://www.nccgroup.com/ Developed by Jerome Smith @exploresecurity (with

NCC Group Plc 90 Sep 19, 2022
pyNPS - A cli Linux and Windows Nopaystation client made with python 3 and wget

Currently, all the work is being done inside the refactoring branch. pyNPS - A cli Linux and Windows Nopaystation client made with python 3 and wget P

Everton Correia 45 Dec 11, 2022
A dilligent command line tool to publish ads on ebay-kleinanzeigen.de

kleinanzeigen-bot Feedback and high-quality pull requests are highly welcome! About Installation Usage Development Notes License About kleinanzeigen-b

83 Dec 26, 2022
Freaky fast fuzzy Denite/CtrlP matcher for vim/neovim

Freaky fast fuzzy Denite/CtrlP matcher for vim/neovim This is a matcher plugin for denite.nvim and CtrlP.

Raghu 113 Sep 29, 2022
Borderless-Window-Utility - Modifies window style to force most applications into a borderless windowed mode

Borderless-Window-Utility Modifies window style to force most applications into

8 Oct 22, 2022
A python based command line tool to compare Github Users or Repositories

gitcomp A simple python package with a CLI to compare GitHub users and repositories by associating a git_score to each entry which is a weighted sum o

Anirudh Vaish 5 Mar 26, 2022
Play videos in the terminal.

Termvideo Play videos in the terminal (stdout). python main.py /path/to/video.mp4 Terminal size: -x output_width, -y output_height. Default autodetect

Patrick 11 Jun 13, 2022
An open-source CLI tool for backing up RDS(PostgreSQL) Locally or to Amazon S3 bucket

An open-source CLI tool for backing up RDS(PostgreSQL) Locally or to Amazon S3 bucket

1 Oct 30, 2021
CLI/GUI Math commands based on python 3

PyMath Commands Syntax Installation Commands: pymath add: usage: pymath add 12.5 12.5 sub: usage: pymath sub 25 12.5 div: usage: pymath div 144 12 mul

eggsnham07 0 Nov 22, 2021
The WalletsNet CLI helps you connect to WalletsNet

WalletsNet CLI The WalletsNet CLI helps you connect to WalletsNet. With the CLI, you can: Trigger webhook events or resend events for easy testing Tai

WalletsClub 8 Dec 22, 2021
Simple Digital Ocean CLI by python.

Simple Digital Ocean CLI by python.

Chiro 2 Jan 01, 2023
GoSearch for anything from your terminal

GoSearch for anything from your terminal Requirements pip install beautifulsoup4

Malik Mouhiidine 1 Oct 02, 2021
A simple CLI based any Download Tool, that find files and let you stream or download thorugh WebTorrent CLI or Aria or any command tool

Privateer A simple CLI based any Download Tool, that find files and let you stream or download thorugh WebTorrent CLI or Aria or any command tool How

Shreyash Chavan 2 Apr 04, 2022