Google-drive-to-sqlite - Create a SQLite database containing metadata from Google Drive

Overview

google-drive-to-sqlite

PyPI Changelog Tests License

Create a SQLite database containing metadata from Google Drive

If you use Google Drive, and especially if you have shared drives with other people there's a good chance you have hundreds or even thousands of files that you may not be fully aware of.

This tool can download metadata about those files - their names, sizes, folders, content types, permissions, creation dates and more - and store them in a SQLite database.

This lets you use SQL to analyze your Google Drive contents, using Datasette or the SQLite command-line tool or any other SQLite database browsing software.

Installation

Install this tool using pip:

$ pip install google-drive-to-sqlite

Authentication

⚠️ This application has not yet been verified by Google - you may find you are unable to authenticate until that verification is complete. #10

First, authenticate with Google Drive using the auth command:

% google-drive-to-sqlite auth
Visit the following URL to authenticate with Google Drive

https://accounts.google.com/o/oauth2/v2/auth?...

Then return here and paste in the resulting code:
Paste code here: 

Follow the link, sign in with Google Drive and then copy and paste the resulting code back into the tool.

This will save an authentication token to the file called auth.json in the current directory.

To specify a different location for that file, use the --auth option:

google-drive-to-sqlite auth --auth ~/google-drive-auth.json

The auth command also provides options for using a different scope, Google client ID and Google client secret. You can use these to create your own custom authentication tokens that can work with other Google APIs, see issue #5 for details.

Full --help:

Usage: google-drive-to-sqlite auth [OPTIONS]

  Authenticate user and save credentials

Options:
  -a, --auth FILE              Path to save token, defaults to auth.json
  --google-client-id TEXT      Custom Google client ID
  --google-client-secret TEXT  Custom Google client secret
  --scope TEXT                 Custom token scope
  --help                       Show this message and exit.

To revoke the token that is stored in auth.json, run the revoke command:

google-drive-to-sqlite revoke

Or if your token is stored in another location:

google-drive-to-sqlite revoke -a ~/google-drive-auth.json

google-drive-to-sqlite files

To retrieve metadata about the files in your Google Drive, or a folder or search within it, use the google-drive-to-sqlite files command.

This will default to writing details about every file in your Google Drive to a SQLite database:

google-drive-to-sqlite files files.db

Files will be written to a files table, which will be created if it does not yet exist.

If a file already exists in that table, based on a matching id, it will be replaced with fresh data.

Instead of writing to SQLite you can use --json to output as JSON, or --nl to output as newline-delimited JSON:

google-drive-to-sqlite files --nl

Use --folder ID to retrieve everything in a specified folder and its sub-folders:

google-drive-to-sqlite files files.db --folder 1E6Zg2X2bjjtPzVfX8YqdXZDCoB3AVA7i

Use --q QUERY to use a custom search query:

google-drive-to-sqlite files files.db -q 'starred = true'

Use --full-text TEXT to search for files where the full text matches a search term:

google-drive-to-sqlite files files.db --full-text 'datasette'

Use --stop-after X to stop after retrieving X files.

Full --help:

Usage: google-drive-to-sqlite files [OPTIONS] [DATABASE]

  Retrieve metadata for files in Google Drive, and write to a SQLite database or
  output as JSON.

      google-drive-to-sqlite files files.db

  Use --json to output JSON, --nl for newline-delimited JSON:

      google-drive-to-sqlite files files.db --json

  Use a folder ID to recursively fetch every file in that folder and its sub-
  folders:

      google-drive-to-sqlite files files.db --folder
      1E6Zg2X2bjjtPzVfX8YqdXZDCoB3AVA7i

Options:
  -a, --auth FILE       Path to auth.json token file
  --folder TEXT         Files in this folder ID and its sub-folders
  -q TEXT               Files matching this query
  --full-text TEXT      Search for files with text match
  --json                Output JSON rather than write to DB
  --nl                  Output newline-delimited JSON rather than write to DB
  --stop-after INTEGER  Stop paginating after X results
  --help                Show this message and exit.

google-drive-to-sqlite download FILE_ID

The download command can be used to download files from Google Drive.

You'll need one or more file IDs, which look something like 0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB.

To download the file, run this:

google-drive-to-sqlite download 0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB

This will detect the content type of the file and use that as the extension - so if this file is a JPEG the file would be downloaded as:

0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB.jpeg

You can pass multiple file IDs to the command at once.

To hide the progress bar and filename output, use -s or --silent.

If you are downloading a single file you can use the -o output to specify a filename and location:

google-drive-to-sqlite download 0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB \
  -o my-image.jpeg

Use -o - to write the file contents to standard output:

google-drive-to-sqlite download 0B32uDVNZfiEKLUtIT1gzYWN2NDI4SzVQYTFWWWxCWUtvVGNB \
  -o - > my-image.jpeg

Full --help:

Usage: google-drive-to-sqlite download [OPTIONS] FILE_IDS...

  Download one or more file IDs to disk

Options:
  -a, --auth FILE    Path to auth.json token file
  -o, --output FILE  File to write to, or - for standard output
  -s, --silent       Hide progress bar and filename
  --help             Show this message and exit.

google-drive-to-sqlite get URL

The get command makes authenticated requests to the specified URL, using credentials derived from the auth.json file.

For example:

% google-drive-to-sqlite get 'https://www.googleapis.com/drive/v3/about?fields=*'
{
    "kind": "drive#about",
    "user": {
        "kind": "drive#user",
        "displayName": "Simon Willison",
# ...

If the resource you are fetching supports pagination you can use --paginate key to paginate through all of the rows in a specified key. For example, the following API has a nextPageToken key and a files list, suggesting it supports pagination:

% google-drive-to-sqlite get https://www.googleapis.com/drive/v3/files
{
    "kind": "drive#fileList",
    "nextPageToken": "~!!~AI9...wogHHYlc=",
    "incompleteSearch": false,
    "files": [
        {
            "kind": "drive#file",
            "id": "1YEsITp_X8PtDUJWHGM0osT-TXAU1nr0e7RSWRM2Jpyg",
            "name": "Title of a spreadsheet",
            "mimeType": "application/vnd.google-apps.spreadsheet"
        },

To paginate through everything in the files list you would use --paginate files lyike this:

% google-drive-to-sqlite get https://www.googleapis.com/drive/v3/files --paginate files
[
  {
    "kind": "drive#file",
    "id": "1YEsITp_X8PtDUJWHGM0osT-TXAU1nr0e7RSWRM2Jpyg",
    "name": "Title of a spreadsheet",
    "mimeType": "application/vnd.google-apps.spreadsheet"
  },
  # ...

Add --nl to stream paginated data as newline-delimited JSON:

% google-drive-to-sqlite get https://www.googleapis.com/drive/v3/files --paginate files --nl
{"kind": "drive#file", "id": "1YEsITp_X8PtDUJWHGM0osT-TXAU1nr0e7RSWRM2Jpyg", "name": "Title of a spreadsheet", "mimeType": "application/vnd.google-apps.spreadsheet"}
{"kind": "drive#file", "id": "1E6Zg2X2bjjtPzVfX8YqdXZDCoB3AVA7i", "name": "Subfolder", "mimeType": "application/vnd.google-apps.folder"}

Add --stop-after 5 to stop after 5 records - useful for testing.

Full --help:

Usage: google-drive-to-sqlite get [OPTIONS] URL

  Make an authenticated HTTP GET to the specified URL

Options:
  -a, --auth FILE       Path to auth.json token file
  --paginate TEXT       Paginate through all results in this key
  --nl                  Output paginated data as newline-delimited JSON
  --stop-after INTEGER  Stop paginating after X results
  --help                Show this message and exit.

Privacy policy

This tool requests access to your Google Drive account in order to retrieve metadata about your files there. It also offers a feature that can download the content of those files.

The credentials used to access your account are stored in the auth.json file on your computer. The metadata and content retrieved from Google Drive is also stored only on your own personal computer.

At no point to the developers of this tool gain access to any of your data.

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd google-drive-to-sqlite
python -m venv venv
source venv/bin/activate

Or if you are using pipenv:

pipenv shell

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest
Issues
  • Submit OAuth app to be verified by Google

    Submit OAuth app to be verified by Google

    So people can start using this tool.

    enhancement 
    opened by simonw 15
  • google-drive-to-sqlite initial release

    google-drive-to-sqlite initial release

    Basic design:

    google-drive-to-sqlite files google.db FOLDER_ID
    

    Looks for auth.json with credentials in the current directory - it wants a refresh_token or an access_token of some sort.

    Features for the initial release:

    • [x] #2
    • [x] #3
    • [x] #7
    • [x] #5
    design 
    opened by simonw 13
  • `google-drive-to-sqlite download FILE_ID` command

    `google-drive-to-sqlite download FILE_ID` command

    Here's the recipe that worked for retrieving the binary contents of a file - the trick is the alt=media parameter:

    def get_binary(file_id):
        return httpx.get(
            "https://www.googleapis.com/drive/v3/files/{}?alt=media".format(file_id),
            headers={
                "Authorization": "Bearer {}".format(access_token)
            }
        ).content
    

    Originally posted by @simonw in https://github.com/simonw/google-drive-to-sqlite/issues/1#issuecomment-1041026849

    enhancement 
    opened by simonw 11
  • `google-drive-to-sqlite auth` command

    `google-drive-to-sqlite auth` command

    Authentication will be tricky. For the moment I'll go with the simplest thing possible, but I may need to build a google-drive-to-sqlite auth command just to get things up and running.

    Originally posted by @simonw in https://github.com/simonw/google-drive-to-sqlite/issues/1#issuecomment-1041023730

    research 
    opened by simonw 9
  • Add `auth` options for setting client ID and secret and scope

    Add `auth` options for setting client ID and secret and scope

    If auth could optionally take an alternative client ID and secret and scope (and write them to auth.json) then the get command could be used by power users (mainly me) to explore other Google APIs.

    • #3
    enhancement 
    opened by simonw 9
  • `google-drive-to-sqlite files` command

    `google-drive-to-sqlite files` command

    I'm going to have this:

    google-drive-to-sqlite files google.db
    

    Retrieve ALL files in the drive.

    google-drive-to-sqlite files google.db --folder FOLDER_ID
    

    Will do just the files in that folder - using ?q= and "folder_id" in parents - but applied recursively to all of the sub-folders.

    google-drive-to-sqlite files google.db --q search_term
    

    Will allow advanced search terms (passed to ?q=).

    Originally posted by @simonw in https://github.com/simonw/google-drive-to-sqlite/issues/1#issuecomment-1041172083

    enhancement 
    opened by simonw 5
  • httpx TimeoutError

    httpx TimeoutError

    Got this exception while testing code from #7:

      File "/Users/simon/Dropbox/Development/google-drive-to-sqlite/google_drive_to_sqlite/utils.py", line 25, in paginate_files
        data = httpx.get(
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_api.py", line 189, in get
        return request(
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_api.py", line 100, in request
        return client.request(
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 802, in request
        return self.send(request, auth=auth, follow_redirects=follow_redirects)
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 889, in send
        response = self._send_handling_auth(
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 917, in _send_handling_auth
        response = self._send_handling_redirects(
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 954, in _send_handling_redirects
        response = self._send_single_request(request)
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_client.py", line 990, in _send_single_request
        response = transport.handle_request(request)
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_transports/default.py", line 217, in handle_request
        with map_httpcore_exceptions():
      File "/Users/simon/.pyenv/versions/3.10.0/lib/python3.10/contextlib.py", line 153, in __exit__
        self.gen.throw(typ, value, traceback)
      File "/Users/simon/.local/share/virtualenvs/google-drive-to-sqlite-Wr1nXkpK/lib/python3.10/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
        raise mapped_exc(message) from exc
    httpx.ReadTimeout: The read operation timed out
    
    bug 
    opened by simonw 5
  • Invalid Credentials error if `access_token` expires

    Invalid Credentials error if `access_token` expires

    I left a files job running which took over an hour, and came back to:

    google-drive-to-sqlite big_folder.db --folder 1E6Zg2X2bjjtPzVfX8YqdXZDCoB3AVA7i
    

    google_drive_to_sqlite.utils.FilesError: {'error': {'errors': [{'domain': 'global', 'reason': 'authError', 'message': 'Invalid Credentials', 'locationType': 'header', 'location': 'Authorization'}], 'code': 401, 'message': 'Invalid Credentials'}}

    It looks like it took longer than an hour and the access_token expired!

    Since we have a refresh_token it would be feasible to catch this and try again with a new token.

    bug 
    opened by simonw 5
  • `google-drive-to-sqlite revoke` for revoking token stored in `auth.json`

    `google-drive-to-sqlite revoke` for revoking token stored in `auth.json`

    If a user needs to revoke their refresh_token from auth.json how can they do that?

    enhancement research 
    opened by simonw 5
  • `google-drive-to-sqlite get` command

    `google-drive-to-sqlite get` command

    As seen in https://github.com/dogsheep/github-to-sqlite#making-authenticated-api-calls

    enhancement 
    opened by simonw 3
  • Fetch just files modified since last fetch

    Fetch just files modified since last fetch

    To avoid having to pull everything, it would be useful to be able to fetch just metadata that is likely to have changed since the last time files was run.

    https://developers.google.com/drive/api/v3/search-files says:

    modifiedTime > '2012-06-04T12:00:00' // default time zone is UTC
    

    I can do this in SQL:

    select max(modifiedTime) from files
    

    Which currently returns 2022-02-18T00:00:16.995Z.

    research 
    opened by simonw 1
  • Wait for Google to verify (or not verify) my client ID

    Wait for Google to verify (or not verify) my client ID

    Submitted that here:

    • #10

    Now I wait.

    blocked 
    opened by simonw 1
  • Do something interesting with thumbnails

    Do something interesting with thumbnails

    https://drive.google.com/thumbnail?sz=w800-h800&id=0B32uDVNZfiEKOUliZ3NMNC0yV3RiYkZabjhlSHhXZ2UxOEZn redirects to the correct thumbnail for the image, if the user is signed in and has permission to view it.

    That ID is the file ID. You can use any width and height you like.

    The thumbnailLink we are fetching right now is useless because it expires too quickly.

    research 
    opened by simonw 0
  • Redesign schema with more than one table

    Redesign schema with more than one table

    Right now everything goes in one files table.

    enhancement 
    opened by simonw 2
Releases(0.1a2)
  • 0.1a2(Feb 17, 2022)

  • 0.1a1(Feb 17, 2022)

  • 0.1a0(Feb 16, 2022)

    This initial alpha release may not work for anyone other than me, as Google still need to verify my OAuth application.

    • google-drive-to-sqlite auth command for authenticating with Google and storing the resulting credentials. #2
    • google-drive-to-sqlite files command for fetching file metadata from Google Drive and storing it in a SQLite database file, or outputting it as JSON. #7
    • google-drive-to-sqlite get command for executing other authenticated API calls against Google APIs. #3
    Source code(tar.gz)
    Source code(zip)
ICRA 2021 "Towards Precise and Efficient Image Guided Depth Completion"

PENet: Precise and Efficient Depth Completion This repo is the PyTorch implementation of our paper to appear in ICRA2021 on "Towards Precise and Effic

137 Jan 12, 2022
Deep Learning Models for Causal Inference

Extensive tutorials for learning how to build deep learning models for causal inference using selection on observables in Tensorflow 2.

Bernard J Koch 62 Jan 17, 2022
An implementation for `Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction`

Text2Event An implementation for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction Please contact Yaojie Lu (@

Roger 71 Jan 12, 2022
WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region. This repository provides the codebase and dataset for our work WORD: Revisiting Or

Healthcare Intelligence Laboratory 22 Jan 19, 2022
A boosting-based Multiple Instance Learning (MIL) package that includes MIL-Boost and MCIL-Boost

A boosting-based Multiple Instance Learning (MIL) package that includes MIL-Boost and MCIL-Boost

Jun-Yan Zhu 26 May 08, 2021
Campsite Reservation Finder

yellowstone-camping UPDATE: yellowstone-camping is being expanded and renamed to camply. The updated tool now interfaces with the Recreation.gov API a

Justin Flannery 160 Jan 27, 2022
This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

EEND-vector clustering The EEND-vector clustering (End-to-End-Neural-Diarization-vector clustering) is a speaker diarization framework that integrates

30 Jan 13, 2022
MvtecAD unsupervised Anomaly Detection

MvtecAD unsupervised Anomaly Detection This respository is the unofficial implementations of DFR: Deep Feature Reconstruction for Unsupervised Anomaly

2 Jan 02, 2022
Аналитика доходности инвестиционного портфеля в Тинькофф брокере

Аналитика доходности инвестиционного портфеля Тиньков Видео на YouTube Для работы скрипта нужно установить три переменных окружения: export TINKOFF_TO

Alexey Goloburdin 56 Dec 26, 2021
Multi-Person Extreme Motion Prediction

Multi-Person Extreme Motion Prediction Implementation for paper Wen Guo, Xiaoyu Bie, Xavier Alameda-Pineda, Francesc Moreno-Noguer, Multi-Person Extre

GUO-W 7 Jan 12, 2022
Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

Python Kafka reset consumergroup offset example This is a simple example of how

Willi Carlsen 1 Feb 15, 2022
Perception-aware multi-sensor fusion for 3D LiDAR semantic segmentation (ICCV 2021)

Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation (ICCV 2021) [中文|EN] 概述 本工作主要探索一种高效的多传感器(激光雷达和摄像头)融合点云语义分割方法。现有的多传感器融合方法主要将点云投影

ICE 54 Jan 23, 2022
The openspoor package is intended to allow easy transformation between different geographical and topological systems commonly used in Dutch Railway

Openspoor The openspoor package is intended to allow easy transformation between different geographical and topological systems commonly used in Dutch

3 Jan 27, 2022
DLL: Direct Lidar Localization

DLL: Direct Lidar Localization Summary This package presents DLL, a direct map-based localization technique using 3D LIDAR for its application to aeri

Service Robotics Lab 73 Jan 27, 2022
Information Gain Filtration (IGF) is a method for filtering domain-specific data during language model finetuning. IGF shows significant improvements over baseline fine-tuning without data filtration.

Information Gain Filtration Information Gain Filtration (IGF) is a method for filtering domain-specific data during language model finetuning. IGF sho

3 Sep 25, 2021
Resources related to our paper "CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain"

CLIN-X (CLIN-X-ES) & (CLIN-X-EN) This repository holds the companion code for the system reported in the paper: "CLIN-X: pre-trained language models a

Bosch Research 3 Dec 29, 2021
PyTorch implementation of the implicit Q-learning algorithm (IQL)

Implicit-Q-Learning (IQL) PyTorch implementation of the implicit Q-learning algorithm IQL (Paper) Currently only implemented for online learning. Offl

Sebastian Dittert 14 Dec 16, 2021
Ranking Models in Unlabeled New Environments (iccv21)

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

14 Dec 16, 2021
Boston House Prediction Valuation Tool

Boston-House-Prediction-Valuation-Tool From Below Anlaysis The Valuation Tool is Designed Correlation Matrix Regrssion Analysis Between Target Vs Pred

1 Nov 26, 2021
Implementation of "Debiasing Item-to-Item Recommendations With Small Annotated Datasets" (RecSys '20)

Debiasing Item-to-Item Recommendations With Small Annotated Datasets This is the code for our RecSys '20 paper. Other materials can be found here: Ful

Microsoft 30 Dec 09, 2021