Command-line tool for downloading and extending the RedCaps dataset.

Overview

RedCaps Downloader

This repository provides the official command-line tool for downloading and extending the RedCaps dataset. Users can seamlessly download images of officially released annotations as well as download more image-text data from any subreddit over an arbitrary time-span.

Installation

This tool requires Python 3.8 or higher. We recommend using conda for setup. Download Anaconda or Miniconda first. Then follow these steps:

# Clone the repository.
git clone https://github.com/redcaps-dataset/redcaps-downloader
cd redcaps-downloader

# Create a new conda environment.
conda create -n redcaps python=3.8
conda activate redcaps

# Install dependencies along with this code.
pip install -r requirements.txt
python setup.py develop

Basic usage: Download official RedCaps dataset

We expect most users will only require this functionality. Follow these steps to download the official RedCaps annotations and images and arrange all the data in recommended directory structure:

/path/to/redcaps/
├── annotations/
│   ├── abandoned_2017.json
│   ├── abandoned_2017.json
│   ├── ...
│   ├── itookapicture_2019.json
│   ├── itookapicture_2020.json
│   ├── 
   
    _
    
     .json
│   └── ...
│
└── images/
    ├── abandoned/
    │   ├── guli1.jpg
    |   └── ...
    │
    ├── itookapicture/
    │   ├── 1bd79.jpg
    |   └── ...
    │
    ├── 
     
      /
    │   ├── 
      
       .jpg
    │   ├── ...
    └── ...

      
     
    
   
  1. Create an empty directory and symlink it relative to this code directory:

    cd redcaps-downloader
    
    # Edit path here:
    mkdir -p /path/to/redcaps
    ln -s /path/to/redcaps ./datasets/redcaps
  2. Download official RedCaps annotations from Dropbox and unzip them.

    cd datasets/redcaps
    wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
    unzip redcaps_v1.0_annotations.zip
  3. Download images by using redcaps download-imgs command (for a single annotation file).

    for ann_file in ./datasets/redcaps/annotations/*.json; do
        redcaps download-imgs -a $ann_file --save-to path/to/images --resize 512 -j 4
        # Set --resize -1 to turn off resizing shorter edge (saves disk space).
    done

    Parallelize download by changing -j. RedCaps images are sourced from Reddit, Imgur and Flickr, each have their own request limits. This code contains approximate sleep intervals to manage them. Use multiple machines (= different IP addresses) or a cluster to massively parallelize downloading.

That's it, you are all set to use RedCaps!

Advanced usage: Create your own RedCaps-like dataset

Apart from downloading the officially released dataset, this tool supports downloading image-text data from any subreddit – you can reproduce the entire collection pipeline as well as create your own variant of RedCaps! Here, we show how to collect annotations from r/roses (2020) as an example. Follow these steps for any subreddit and years.

Additional one-time setup instructions

RedCaps annotations are extracted from image post metadata, which are served by the Pushshift API and official Reddit API. These APIs are authentication-based, and one must sign up for developer access to obtain API keys (one-time setup):

  1. Copy ./credentials.template.json to ./credentials.json. Its contents are as follows:

    : " }, "imgur": { "client_id": "Your client ID here", "client_secret": "Your client secret here" } } ">
    {
        "reddit": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here",
            "username": "Your Reddit username here",
            "password": "Your Reddit password here",
            "user_agent": "
          
           : 
           "
          
        },
        "imgur": {
            "client_id": "Your client ID here",
            "client_secret": "Your client secret here"
        }
    }
  2. Register a new Reddit app here. Reddit will provide a Client ID and Client Secret tokens - fill them in ./credentials.json. For more details, refer to the Reddit OAuth2 wiki. Enter your Reddit account name and password in ./credentials.json. Set User Agent to anything and keep it unchanged (e.g. your name).

  3. Register a new Imgur App by following instructions here. Fill the provided Client ID and Client Secret in ./credentials.json.

  4. Download pre-trained weights of an NSFW detection model.

    wget https://s3.amazonaws.com/nsfwdetector/nsfw.299x299.h5 -P ./datasets/redcaps/models

Data collection from r/roses (2020)

  1. download-anns: Dowload annotations of image posts made in a single month (e.g. January).

    redcaps download-anns --subreddit roses --month 2020-01 -o ./datasets/redcaps/annotations
    
    # Similarly, download annotations for all months of 2020:
    for ((month = 1; month <= 12; month += 1)); do
        redcaps download-anns --subreddit roses --month 2020-$month -o ./datasets/redcaps/annotations
    done
    • NOTE: You may not get all the annotations present in official release as some of them may have disappeared (deleted) over time. After this step, the dataset directory would contain 12 annotation files:
        ./datasets/redcaps/
        └── annotations/
            ├── roses_2020-01.json
            ├── roses_2020-02.json
            ├── ...
            └── roses_2020-12.json
    
  2. merge: Merge all the monthly annotation files into a single file.

    redcaps merge ./datasets/redcaps/annotations/roses_2020-* \
        -o ./datasets/redcaps/annotations/roses_2020.json --delete-old
    • --delete-old will remove individual files after merging. After this step, the merged file will replace individual monthly files:
        ./datasets/redcaps/
        └── annotations/
            └── roses_2020.json
    
  3. download-imgs: Download all images for this annotation file. This step is same as (3) in basic usage.

    redcaps download-imgs --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --resize 512 -j 4 -o ./datasets/redcaps/images --update-annotations
    • --update-annotations removes annotations whose images were not downloaded.
  4. filter-words: Filter all instances whose captions contain potentially harmful language. Any caption containing one of the 400 blocklisted words will be removed. This command modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-words --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images
  5. filter-nsfw: Remove all instances having images that are flagged by an off-the-shelf NSFW detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-nsfw --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images \
        --model ./datasets/redcaps/models/nsfw.299x299.h5
  6. filter-faces: Remove all instances having images with faces detected by an off-the-shelf face detector. This command also modifies the annotation file in-place and deletes the corresponding images from disk.

    redcaps filter-faces --annotations ./datasets/redcaps/annotations/roses_2020.json \
        --images ./datasets/redcaps/images  # Model weights auto-downloaded
  7. validate: All the above steps create a single annotation file (and downloads images) similar to official RedCaps annotations. To double-check this, run the following command and expect no errors to be printed.

    redcaps validate --annotations ./datasets/redcaps/annotations/roses_2020.json

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021redcaps,
    title={{RedCaps: Web-curated image-text data created by the people, for the people}},
    author={Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson},
    booktitle={NeurIPS Datasets and Benchmarks},
    year={2021}
}
Owner
RedCaps dataset
RedCaps dataset
Search Youtube Video and Get Video info

PyYouTube Get Video Data from YouTube link Installation pip install PyYouTube How to use it ? Get Videos Data from pyyoutube import Data yt = Data("ht

lokaman chendekar 35 Nov 25, 2022
Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

Dongkwan Kim 127 Dec 28, 2022
Real-Time Semantic Segmentation in Mobile device

Real-Time Semantic Segmentation in Mobile device This project is an example project of semantic segmentation for mobile real-time app. The architectur

708 Jan 01, 2023
Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

GRAF This repository contains official code for the paper GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. You can find detailed usage i

349 Dec 29, 2022
code release for USENIX'22 paper `On the Security Risks of AutoML`

This project is a minimized runnable project cut from trojanzoo, which contains more datasets, models, attacks and defenses. This repo will not be mai

Ren Pang 5 Apr 19, 2022
PASSL包含 SimCLR,MoCo,BYOL,CLIP等基于对比学习的图像自监督算法以及 Vision-Transformer,Swin-Transformer,BEiT,CVT,T2T,MLP_Mixer等视觉Transformer算法

PASSL Introduction PASSL is a Paddle based vision library for state-of-the-art Self-Supervised Learning research with PaddlePaddle. PASSL aims to acce

186 Dec 29, 2022
An NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.

简体中文 | English News [2021-10-12] PaddleNLP 2.1版本已发布!新增开箱即用的NLP任务能力、Prompt Tuning应用示例与生成任务的高性能推理! 🎉 更多详细升级信息请查看Release Note。 [2021-08-22]《千言:面向事实一致性的生

6.9k Jan 01, 2023
Automatic differentiation with weighted finite-state transducers.

GTN: Automatic Differentiation with WFSTs Quickstart | Installation | Documentation What is GTN? GTN is a framework for automatic differentiation with

100 Dec 29, 2022
Traditional deepdream with VQGAN+CLIP and optical flow. Ready to use in Google Colab

VQGAN-CLIP-Video cat.mp4 policeman.mp4 schoolboy.mp4 forsenBOG.mp4

23 Oct 26, 2022
Short and long time series classification using convolutional neural networks

time-series-classification Short and long time series classification via convolutional neural networks In this project, we present a novel framework f

35 Oct 22, 2022
Implementation of Vaswani, Ashish, et al. "Attention is all you need."

Attention Is All You Need Paper Implementation This is my from-scratch implementation of the original transformer architecture from the following pape

Brando Koch 195 Dec 30, 2022
A PyTorch Implementation of "SINE: Scalable Incomplete Network Embedding" (ICDM 2018).

Scalable Incomplete Network Embedding ⠀⠀ A PyTorch implementation of Scalable Incomplete Network Embedding (ICDM 2018). Abstract Attributed network em

Benedek Rozemberczki 69 Sep 22, 2022
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

End-to-End Object Detection with Learnable Proposal, CVPR2021

Peize Sun 1.2k Dec 27, 2022
Performance Analysis of Multi-user NOMA Wireless-Powered mMTC Networks: A Stochastic Geometry Approach

Performance Analysis of Multi-user NOMA Wireless-Powered mMTC Networks: A Stochastic Geometry Approach Thanh Luan Nguyen, Tri Nhu Do, Georges Kaddoum

Thanh Luan Nguyen 2 Oct 10, 2022
Visualization toolkit for neural networks in PyTorch! Demo -->

FlashTorch A Python visualization toolkit, built with PyTorch, for neural networks in PyTorch. Neural networks are often described as "black box". The

Misa Ogura 692 Dec 29, 2022
Python binding for Khiva library.

Khiva-Python Build Documentation Build Linux and Mac OS Build Windows Code Coverage README This is the Khiva Python binding, it allows the usage of Kh

Shapelets 46 Oct 16, 2022
Incomplete easy-to-use math solver and PDF generator.

Math Expert Let me do your work Preview preview.mp4 Introduction Math Expert is our (@salastro, @younis-tarek, @marawn-mogeb) math high school graduat

SalahDin Ahmed 22 Jul 11, 2022
[v1 (ISBI'21) + v2] MedMNIST: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification

MedMNIST Project (Website) | Dataset (Zenodo) | Paper (arXiv) | MedMNIST v1 (ISBI'21) Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bili

683 Dec 28, 2022
Python wrappers to the C++ library SymEngine, a fast C++ symbolic manipulation library.

SymEngine Python Wrappers Python wrappers to the C++ library SymEngine, a fast C++ symbolic manipulation library. Installation Pip See License section

136 Dec 28, 2022
Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion

Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion Preface This directory provides an implementation of the algori

Jean-Samuel Leboeuf 0 Nov 03, 2021