A Python package that can be used to download post and comment data from Reddit.

Overview

Reddit Data Collector

Reddit Data Collector is a Python package that allows a user to collect post and comment data from Reddit. It is built on top of the Python module PRAW, which stands for "The Python Reddit API Wrapper". It aims to make it very simple for a user to collect data from Reddit for further analysis (e.g. Natural Language Processing), without having to learn the inner workings of PRAW or the Reddit API.

The main functionalities provided by the package currently include:

  1. Ability to collect a sample of post data and comment data from Reddit by simply providing the subreddit names that you wish to collect data from.

  2. Ability to convert that data into a pandas DataFrame in order to inspect it and save it for further use.

  3. Ability to seamlessly update an existing .csv file that contains some sample data collected with the package in the past, with some new sample data that is also collected with the package.

It is currently maintained by Nico Van den Hooff.

Installation

Dependencies

Reddit Data Collector requires Python and:

  • pandas (>=1.3.5)
  • praw (>=7.5.0)
  • tqdm (>=4.62.3)

User installation

The recommended way to install Reddit Data Collector is using pip:

pip install reddit-data-collector

How to Use Reddit Data Collector

Please see the examples directory for step by step instructions on how to use Reddit Data Collector.

Development

Important links

Source code

You can check the latest sources with the command:

git clone https://github.com/nicovandenhooff/reddit-data-collector.git

Contributing

To learn more about making a contribution to Reddit Data Collector, please see the contributing file.

Potential Ideas for Contribution

  • Add ability to collect images from Reddit posts that contain them.
  • Add author information to post and comment data, currently the Reddit API is inconsistent with suspended and deleted author data, so this functionality has not been built in yet.

Testing

After installation, you can launch the test suite, which is contained in the tests/tests.py. Note that you will have to have pytest >= 6.2.5 installed. You can launch the test suite by following these steps from the projects root directory:

  1. Open up tests.py with the following command:
open tests/tests.py

Comment out lines 24 to 30. Change the values in DataCollector() in line 32 to your Reddit credentials.

  1. Run the following command:
pytest tests/test.py

Project History

The project was started in January 2022 by Nico Van den Hooff as a side project while he was completing the UBC Master of Data Science Project. Nico wanted to obtain a sample of posts and comments from Reddit, but noticed that while PRAW existed and provided seamless access to Reddit's API, there was no package available that allowed for a simple method to collect this data.

Inspiration

Certain sections of this README file was inspired by the scikit-learn README.

You might also like...
Auto Join: A GitHub action script to automatically invite everyone to the organization who comment at the issue page.

Auto Invite To Org By Issue Comment A GitHub action script to automatically invite everyone to the organization who comment at the issue page. What is

Auto Liker, Auto Reaction, Auto Comment, Auto Follower Tool. RajeLiker Credit Hacker.
Auto Liker, Auto Reaction, Auto Comment, Auto Follower Tool. RajeLiker Credit Hacker.

Auto Liker, Auto Reaction, Auto Comment, Auto Follower Tool. RajeLiker Credit Hacker. Unlimited RajeLiker Credit Hack. Thanks To RajeLiker.

A simple Discord bot that can fetch definitions and post them in chat.
A simple Discord bot that can fetch definitions and post them in chat.

A simple Discord bot that can fetch definitions and post them in chat. If you are connected to a voice channel, the bot will also read out the definition to you.

A simple fun discord bot using discord.py that can post memes

A simple fun discord bot using discord.py * * Commands $commands - to see all commands $meme - for a random meme from the internet $cry - to make the

One version package to rule them all, One version package to find them, One version package to bring them all, and in the darkness bind them.

AwesomeVersion One version package to rule them all, One version package to find them, One version package to bring them all, and in the darkness bind

A simple script & container to pull COVID data from covidlive.com.au and post a summary to a slack channel
A simple script & container to pull COVID data from covidlive.com.au and post a summary to a slack channel

CovidLive AU Summary Slackbot This bot is a very simple slackbot that pulls data, summarises and posts up to date AU COVID stats to a provided slack c

Track live sentiment for stocks from Reddit and Twitter and identify growing stocks
Track live sentiment for stocks from Reddit and Twitter and identify growing stocks

Market Sentiment About This repository can mainly be used for two things. a. Tracking the live sentiment of stocks from Reddit and Twitter b. Tracking

A reddit.com bot that will return reference links from official python documentation site for the standard library.

Python Docs Bot A reddit.com bot that will return documentation links for the library and language reference sections of the python docs website. The

A Python bot that uses the Reddit API to send users inspiring messages.

AnonBot By Edric Antoine A Python bot that uses the Reddit API to send users inspiring messages. When a message includes 'What would Anon do?', the bo

Releases(v1.1.0)
  • v1.1.0(Mar 14, 2022)

    [1.1.0] - 2022-03-13

    Changed

    • Changed get_data in reddit_data_collector.py to return pandas DataFrame by default
    • Updated tests for the above
    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Jan 15, 2022)

    [1.0.2] - 2022-01-14

    Fixed

    • Updated _check_subreddit_exists in reddit_data_collector.py to check both names as .lower()
    • Updated tests for the above

    Changed

    • Updated README to include instructions on coverage tests
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Jan 12, 2022)

    [1.0.1] - 2022-01-12

    Fixed

    • Spelling error of separate argument in to_pandas function of reddit_data_collector.io.py, previously it was spelt like seperate

    Changed

    • Update example use and move to /examples
    • Update PyPi link in docs to working link
    • Add new potential ideas for contribution
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jan 7, 2022)

Owner
Nico Van den Hooff
UBC Master of Data Science
Nico Van den Hooff
A Python AWS Lambda Webhook listener that generates a permanent URL when an asset is created in Contentstack.

Webhook Listener A Python Lambda Webhook Listener - Generates a permanent URL on created assets. See doc on Generating a Permanent URL: https://www.co

Contentstack Solutions 1 Nov 04, 2021
一个基于Python3的Bot。目前支持以Docker的方式部署在vps上。支持Aria2、本子下载、网易云音乐下载、Pixiv榜单下载、Youtue-dl支持、搜图。

介绍 一个基于Python3的Bot。目前支持以Docker的方式部署在vps上。 主要功能: 文件管理 修改主界面为 filebrowser,账号为admin,密码为admin,主界面路径:http://ip:port,请自行修改密码 FolderMagic自带的webdav:路径:http://

Ben 650 Jan 08, 2023
Yes, it's true :yellow_heart: This repository has 326 stars.

Yes, it's true! Inspired by a similar repository from @RealPeha, but implemented using a webhook on AWS Lambda and API Gateway, so it's serverless! If

510 Dec 28, 2022
Bot interpretation of the carbon.now.sh site

📒 Source code of the @PicodeBot 🧸 Developer: @hoosnick Run $ git clone https://github.com/hoosnick/picodebot.git $ pip install -r requirements.txt P

Husniddin Murodov 13 Oct 02, 2022
Gdrive-python: A wrapping module in python of gdrive

gdrive-python gdrive-python is a wrapping module in python of gdrive made by @pr

Vittorio Pippi 3 Feb 19, 2022
Irenedao-nft-generator - Original scripts used to generate IreneDAO NFTs

IreneDAO NFT Generator Scripts to generate IreneDAO NFT. Make sure you have Pill

libevm 60 Oct 27, 2022
A lightweight Python wrapper for the IG Markets API

trading_ig A lightweight Python wrapper for the IG Markets API. Simplifies access to the IG REST and Streaming APIs with a live or demo account. What

IG Python 247 Dec 08, 2022
A project in order to analyze user's favorite musics, artists and genre

Spotify-Wrapped This is a project about Spotify Wrapped (which is an extra option for premium accounts, but you don't need to be premium here) This pr

Hossein Mohseni 19 Jan 04, 2023
A stock information collector and parser for Taiwan and US market. Automatically send LINE message if the pre-defined rules are triggered.

agastock 開發動機 就在海運飆漲的2021年7月,差點跪在地上喜迎財富自由的當下,EPS超高好消息不斷的長榮竟然套在202元一去不回,有圖有真相(哭) 忽然體會到追高殺低不是辦法,魯蛇我得靠邏輯分析也能出頭天,經過三個月無數個不出門的周末,產出簡單的爬蟲和分析工具。 上過金融研訓院的量化交易

Gavin Lee 12 Nov 16, 2022
Powerful Url uploader bot With Mongodb support 🔥

Uploader X Bot Telegram RoBot to Upload Links. Features: 👉 Upload YouTube-dl Supported Links to Telegram. 👉 Upload HTTP/HTTPS as File/Video to Teleg

C͡linton Abraꫝam 250 Jan 06, 2023
A Telegram Bot written in Python for mirroring files on the Internet to your Google Drive

No support is going to be provided of any kind, only maintaining this for vps user on request. This is a Telegram Bot written in Python for mirroring

Sunil Kumar 42 Oct 28, 2022
A comand-line utility for taking automated screenshots of websites

shot-scraper A comand-line utility for taking automated screenshots of websites For background on this project see shot-scraper: automated screenshots

Simon Willison 837 Jan 07, 2023
Infrastructure template and Jupyter notebooks for running RoseTTAFold on AWS Batch.

AWS RoseTTAFold Infrastructure template and Jupyter notebooks for running RoseTTAFold on AWS Batch. Overview Proteins are large biomolecules that play

AWS Samples 20 May 10, 2022
SickNerd aims to slowly enumerate Google Dorks via the googlesearch API then requests found pages for metadata

CLI tool for making Google Dorking a passive recon experience. With the ability to fetch and filter dorks from GHDB.

Jake Wnuk 21 Jan 02, 2023
The Sue Gray Alert System was a 5 minute project that just beeps every time a new article is updated or published on Gov.UK's news pages.

The Sue Gray Alert System was a 5 minute project that just beeps every time a new article is updated or published on Gov.UK's news pages.

Dafydd 1 Jan 31, 2022
🖥️ Python - P1 Monitor API Asynchronous Python Client

🖥️ Asynchronous Python client for the P1 Monitor

Klaas Schoute 9 Dec 12, 2022
Balsam Python client API & SDK

balsam No description provided (generated by Openapi Generator https://github.com/openapitools/openapi-generator) This Python package is automatically

Darren Govoni 1 Oct 22, 2021
pylunasvg - Python bindings for lunasvg

pylunasvg - Python bindings for lunasvg Pylunasvg is a simple wrapper around lunasvg that uses pybind11 to create python bindings. All public API of t

Eren 6 Jan 05, 2023
A Telegram Userbot to play Audio and Video songs / files in Telegram Voice Chats

TG-MusicPlayer A Telegram Userbot to play Audio and Video songs / files in Telegram Voice Chats. It's made with PyTgCalls and Pyrogram Requirements Py

4 Jul 30, 2022
This library is for simplified work with the sms-man.com API

SMSMAN Public API Python This is a lightweight library that works as a connector to Sms-Man public API Installation pip install smsman Documentation h

13 Nov 19, 2022