Command-line tool for downloading and extending the RedCaps dataset.

Overview

RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

      
     
    
   
  1. Create an empty directory and symlink it relative to this code directory:

    cd redcaps-downloader
    
    # Edit path here:
    mkdir -p /path/to/redcaps
    ln -s /path/to/redcaps ./datasets/redcaps
  2. Download official RedCaps annotations from Dropbox and unzip them.

    cd datasets/redcaps
    wget https://www.dropbox.com/s/twvv541sbg0qqux/redcaps_v1.0_annotations.zip?dl=1
    unzip redcaps_v1.0_annotations.zip
  3. Download images by using redcaps download-imgs command (for a single annotation file).

    for ann_file in ./datasets/redcaps/annotations/*.json; do
        redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
        # Set --resize -1 to turn off resizing shorter edge (saves disk space).
    done

    Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

  1. Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

    : " }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } ">
    {
        "reddit": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here",
            "username": "Your Reddit username here",
            "password": "Your Reddit password here",
            "user_agent": "
          
           : 
           "
          
        },
        "imgur": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here"
        }
    }
  2. Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).

  3. Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

  4. Download pre-trained weights of an NSFW detection model.

    wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from r/roses (2020)

  1. download-anns: Dowload annotations of image posts made in a single month (e.g. January).

    redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations
    
    # Similarly, download annotations for all months of 2020:
    for ((month = 1; month <= 12; month += 1)); do
        redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
    done
    • NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:
        ./datasets/redcaps/
        └── annotations/
            ├── roses_2020-01.json
            ├── roses_2020-02.json
            ├── ...
            └── roses_2020-12.json
    
  2. merge: Merge all the monthly annotation files into a single file.

    redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
        -o ./datasets/redcaps/annotations/roses_2020.json --delete-old
    • --delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:
        ./datasets/redcaps/
        └── annotations/
            └── roses_2020.json
    
  3. download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.

    redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
    • --update-annotations removes annotations whose images were not downloaded.
  4. filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images
  5. filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images \
        --model ./datasets/redcaps/models/nsfw.299x299.h5
  6. filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images  # Model weights auto-downloaded
  7. validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.

    redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}
Owner
RedCaps dataset
RedCaps dataset
WazirX Portfolio Tracker on your Terminal!

If you have been investing in crypto in India, there is a very good chance that you are using WazirX. If you are using WazirX, then you definitely know that there is no P&L report, no green arrows no

Raunit 15 Jan 10, 2022
Container images for portable development environments

Docker Dev Spin up a container to develop from anywhere! To run, just: docker run -ti aghost7/nodejs-dev:boron tmux new Alternatively, if on Linux: p

Jonathan Boudreau 163 Dec 22, 2022
🗃️ Fileio-cli wrapper for fileioapi.py with fire.py, inspiration DOS

🗃️ File.io File.io simply upload a file, share the link, and after it is downloaded, the file is completely deleted. An API wrapper for the file.io w

nkot56297 2 May 12, 2022
🔖 Lemnos: A simple, light-weight command-line to-do list manager.

🔖 Lemnos: CLI To-do List Manager This is a simple program that allows one to manage a to-do list via the command-line. Example $ python3 todo.py add

Rohan Sikand 1 Dec 07, 2022
🕰 The command line tool for scheduling Python scripts

hickory is a simple command line tool for scheduling Python scripts.

Max Humber 146 Dec 07, 2022
The most comprehensive, exhaustive, parameterized command-line wordle solver.

Wordle Solver The most comprehensive, exhaustive, parameterized command-line wordle solver. Wordle is a real

Debarghya Das 27 Nov 21, 2022
A Bot Which Send Automatically Commands To Karuta Hub to Gain it's Currency

A Bot Which Send Automatically Commands To Karuta Hub to Gain it's Currency

HarshalWaykole 1 Feb 09, 2022
jrnl is a simple journal application for the command line.

jrnl To get help, submit an issue on Github. jrnl is a simple journal application for the command line. You can use it to easily create, search, and v

jrnl 5.7k Dec 31, 2022
Rover is a command line interface application that allows through browse through mission data, images, metadata from the NASA Official Website

🤖 rover Rover is a command line interface application that allows through browse through mission data, images, metadata from the NASA Official Websit

Saketha Ramanjam 4 Jan 19, 2022
Sink is a CLI tool that allows users to synchronize their local folders to their Google Drives. It is similar to the Git CLI and allows fast and reliable syncs with the drive.

Sink is a CLI synchronisation tool that enables a user to synchronise local system files and folders with their Google Drives. It follows a git C

Yash Thakre 16 May 29, 2022
:computer: tmux session manager. built on libtmux

tmuxp, tmux session manager. built on libtmux. We need help! tmuxp is a trusted session manager for tmux. If you could lend your time to helping answe

python utilities for tmux 3.6k Jan 01, 2023
A minimalist Vim plugin manager.

A minimalist Vim plugin manager. Pros. Easy to set up: Single file. No boilerplate code required. Easy to use: Concise, intuitive syntax Super-fast pa

Junegunn Choi 30.2k Jan 08, 2023
3DigitDev 29 Jan 17, 2022
CLTools provides various tools and command to use in the terminal.

CLTools provides various tools and command to use in the terminal. As of date, CLTools is only able to generate temporary email addresses and receive emails. There are plans to integrate more tools a

Ashwin Chugh 2 Feb 14, 2022
Simple CLI for managing Postgres databases in Flask.

Overview Simple CLI that provides the following commands: flask psql create flask psql init flask psql drop flask psql setup: create → init flask psql

Daniel Reeves 21 Oct 03, 2022
Command line tool to keep track of your favorite playlists on YouTube and many other places.

Command line tool to keep track of your favorite playlists on YouTube and many other places.

Wolfgang Popp 144 Jan 05, 2023
A terminal client for connecting to hack.chat servers

A terminal client for connecting to hack.chat servers.

V9 2 Sep 21, 2022
This is a CLI utility that allows you to view RedFlagDeals.com on the command line.

RFD Description Motivation Installation Usage View Hot Deals View and Sort Hot Deals Search Advanced View Posts Shell Completion bash zsh Description

Dave G 8 Nov 29, 2022
🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.

Terminal string styling done right This is a feature-complete clone of the awesome Chalk (JavaScript) library. All credits go to Sindre Sorhus. Highli

Fabian Keller 132 Dec 27, 2022
img-proof (IPA) provides a command line utility to test images in the Public Cloud

overview img-proof (IPA) provides a command line utility to test images in the Public Cloud (AWS, Azure, GCE, etc.). With img-proof you can now test c

13 Jan 07, 2022