Tools for analyzing Git history using SQLite

Last update: Jan 02, 2023

Related tags

Overview

git-history

Tools for analyzing Git history using SQLite

Installation

Install this tool using pip:

$ pip install git-history

Usage

This tool can be run against a Git repository that holds a file that contains JSON, CSV/TSV or some other format and which has multiple versions tracked in the Git history. See Git scraping to understand how you might create such a repository.

The file command analyzes the history of an individual file within the repository, and generates a SQLite database table that represents the different versions of that file over time.

The file is assumed to contain multiple objects - for example, the results of scraping an electricity outage map or a CSV file full of records.

Assuming you have a file called incidents.json that is a JSON array of objects, with multiple versions of that file recorded in a repository.

Change directory into the GitHub repository in question and run the following:

git-convert file incidents.db incidents.json

This will create a new SQLite database in the incidents.db file with two tables:

commits containing a row for every commit, with a hash column and the commit_at date.
items containing a row for every item in every version of the filename.json file - with an extra commit column that is a foreign key back to the commits table.

If you have 10 historic versions of the incidents.json file and each one contains 30 incidents, you will end up with 10 * 30 = 300 rows in your items table.

De-duplicating items using IDs

If your objects have a unique identifier - or multiple columns that together form a unique identifier - you can use the --id option to de-duplicate and track changes to each of those items over time.

If there is a unique identifier column called IncidentID you could run the following:

git-convert file incidents.db incidents.json --id IncidentID

This will create three tables - commits, items and item_versions.

The items table will contain just the most recent version of each row, de-duplicated by ID.

The item_versions table will contain a row for each captured differing version of that item, plus the following columns:

item as a foreign key to the items table
commit as a foreign key to the commits table
version as the numeric version number, starting at 1 and incrementing for each captured version

If you have already imported history, the command will skip any commits that it has seen already and just process new ones. This means that even though an initial import could be slow subsequent imports should run a lot faster.

Additional options:

--repo DIRECTORY - the path to the Git repository, if it is not the current working directory.
--branch TEXT - the Git branch to analyze - defaults to main.
--id TEXT - as described above: pass one or more columns that uniquely identify a record, so that changes to that record can be calculated over time.
--ignore TEXT - one or more columns to ignore - they will not be included in the resulting database.
--csv - treat the data is CSV or TSV rather than JSON, and attempt to guess the correct dialect
--convert TEXT - custom Python code for a conversion, see below.
--import TEXT - Python modules to import for --convert.
--ignore-duplicate-ids - if a single version of a file has the same ID in it more than once, the tool will exit with an error. Use this option to ignore this and instead pick just the first of the two duplicates.
--silent - don't show the progress bar.

Note that id, item, version, commit and rowid are reserved column names that are used by this tool. If your data contains any of these they will be renamed to id_, item_, version_, commit_ or rowid_ to avoid clashing with the reserved columns.

There is one exception: if you have an id column and use --id id without specifying more than one ID column, your ìd` column will be used as the item ID but will not be renamed.

CSV and TSV data

If the data in your repository is a CSV or TSV file you can process it by adding the --csv option. This will attempt to detect which delimiter is used by the file, so the same option works for both comma- and tab-separated values.

git-convert file trees.db trees.csv --id TreeID

Custom conversions using --convert

If your data is not already either CSV/TSV or a flat JSON array, you can reshape it using the --convert option.

The format needed by this tool is an array of dictionaries that looks like this:

[
    {
        "id": "552",
        "name": "Hawthorne Fire",
        "engines": 3
    },
    {
        "id": "556",
        "name": "Merlin Fire",
        "engines": 1
    }
]

If your data does not fit this shape, you can provide a snippet of Python code to converts the on-disk content of each stored file into a Python list of dictionaries.

For example, if your stored files each look like this:

{
    "incidents": [
        {
            "id": "552",
            "name": "Hawthorne Fire",
            "engines": 3
        },
        {
            "id": "556",
            "name": "Merlin Fire",
            "engines": 1
        }
    ]
}

You could use the following Python snippet to convert them to the required format:

json.loads(content)["incidents"]

(The json module is exposed to your custom function by default.)

You would then run the tool like this:

git-convert file database.db incidents.json \
  --id id \
  --convert 'json.loads(content)["incidents"]'

The content variable is always a bytes object representing the content of the file at a specific moment in the repository's history.

You can import additional modules using --import. This example shows how you could read a CSV file that uses ; as the delimiter:

git-history file trees.db ../sf-tree-history/Street_Tree_List.csv \
  --repo ../sf-tree-history \
  --import csv \
  --import io \
  --convert '
    fp = io.StringIO(content.decode("utf-8"))
    return list(csv.DictReader(fp, delimiter=";"))
    ' \
  --id TreeID

If your Python code spans more than one line it needs to include a return statement.

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd git-history
python -m venv venv
source venv/bin/activate

Or if you are using pipenv:

pipenv shell

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Comments

Add live demos

I'm tempted to pull a bunch of different example repos on a schedule and bundle them into the same demo instance.

Could have a recipes.md documentation page that shares the same demos and shows how they were built, using cog somehow.
documentation research

opened by simonw 32
Default to only storing columns that have changed in item_version

(Original title: Option to store only columns that have changed in item_versions)

When browsing a list of item versions like this one it's difficult to tell at a glance which rows have changed since the previous version:

It would be neat if there was a mode that could only store values in the versions table for columns that have changed since the last version.
enhancement research

opened by simonw 24
Use integer primary keys for smaller tables
Refs #12. Still needs a bit more work:

[x] See if I can come up with a better column name than _item_hash_id (I went with _item_id)

[x] Ship sqlite-utils 3.19 and update dependency

[x] Update schema description in README

[x] Update reserved columns
opened by simonw 7
Support history of more than one file in a single database

Provide a way to customize the name of the items_ and item_versions tables - but also need to work out how to reflect in the commits table that they may be related to other things (perhaps? Maybe it's OK to have commits just live there - but they should probably indicate their repository somehow - maybe via a foreign key to a repos table).
enhancement research

opened by simonw 7

Add a `--dialect` option for forcing a CSV dialect

Running against https://github.com/simonw/fara-history

(git-history) git-history % git-history file fara.db ../fara-history/FARA_All_Registrants.csv --repo ../fara-history --id Registration_Number --changed --branch master --csv
  [------------------------------------]  1/376    0%Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/bin/git-history", line 33, in <module>
    sys.exit(load_entry_point('git-history', 'console_scripts', 'git-history')())
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/git-history-nXMauUZE/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/git-history/git_history/cli.py", line 246, in file
    item = fix_reserved_columns(item)
  File "/Users/simon/Dropbox/Development/git-history/git_history/utils.py", line 8, in fix_reserved_columns
    if not any(reserved_with_suffix_re.match(key) for key in item):
  File "/Users/simon/Dropbox/Development/git-history/git_history/utils.py", line 8, in <genexpr>
    if not any(reserved_with_suffix_re.match(key) for key in item):
TypeError: expected string or bytes-like object

After much debugging, it turns out the problem is running the CSV parser against this specific revision of the file: https://github.com/simonw/fara-history/blob/ab27087f642680697db6c914d094bf3d06b363f3/FARA_All_Registrants.csv

Here's what's happening:

>>> import csv, httpx, io
>>> content = httpx.get("https://raw.githubusercontent.com/simonw/fara-history/ab27087f642680697db6c914d094bf3d06b363f3/FARA_All_Registrants.csv").content
>>> decoded = content.decode("utf-8")
>>> dialect = csv.Sniffer().sniff(decoded[:512])
>>> (dialect.delimiter, dialect.doublequote, dialect.escapechar, dialect.lineterminator, dialect.quotechar, dialect.quoting, dialect.skipinitialspace)
(',', False, None, '\r\n', '"', 0, False)
>>> reader = csv.DictReader(io.StringIO(decoded), dialect=dialect)
>>> items = list(reader)
>>> [it for it in items if it["Registration_Number"] == '4797']
[{'Registration_Number': '4797',
  'Registration_Date': '04/20/1993',
  'Termination_Date': '05/06/1993',
  'Name': 'National Petroleum Company, "Sudan""',
  'Business_Name': ' Ltd."',
  'Address_1': '',
  'Address_2': '525 South Lancaster Street',
  'City': '',
  'State': 'Arlington',
  'Zip': 'VA',
  None: ['22204']}]

What is going on with that last item of None: ['22204']?

bug

opened by simonw 6

Item should link to commit

There's a bug in the non-id branch. Each item in items is redefined, with a new object, then _commit is set, but never used anywhere.

In my use case, I need to know the commit each item comes from, and this fix allows it:

opened by tomviner 5
_version resets to 1 when command is run incrementally

I restarted it to try out an optimization, but then realized there's a nasty bug: the _version resets to 1 if you restart it even if there are already commits in the database.

Originally posted by @simonw in https://github.com/simonw/git-history/issues/21#issuecomment-983116364
bug

opened by simonw 5
Change how reserved columns work to have an underscore prefix

Right now I've implemented it such that id and commit and version and item are reserved columns, and any user-provided data with those columns gets renamed to commit_ and id_ and so-on - see #8.

I've changed my mind. I think this tool's columns should all have a _ prefix instead (like semi-private class properties in Python).

I'm even going to do that for the _id column, mainly so I don't have to explain a special case for id.
enhancement

opened by simonw 5
Columns with list/dict JSON values always detected as changed
It looks like a column with a JSON value such as:

[ "Bridge work", "Long-term construction" ]

Gets recorded as a changed value for every version, even when it hasn't changed.
bug
opened by simonw 3
`--start-at` and `--start-after` options

Found a bad commit which broke the script: https://github.com/simonw/sf-tree-history/commit/3fb63a99dfab8a75c83d341c67afc9abf484e0c4 in https://github.com/simonw/git-history/issues/21#issuecomment-983130553_

Solution: options to say "start at this commit" or "start at the commit AFTER this commit".
enhancement

opened by simonw 3
Support for `--import xml.etree.ElementTree`
https://github.com/simonw/neededge-history/blob/main/v1.xml

This should work, but doesn't without a small change:

git-history file neededge.db v1.xml --id url --convert ' tree = xml.etree.ElementTree.fromstring(content) return [site.attrib for site in tree.iter("site")] ' --import xml.etree.ElementTree

Originally posted by @simonw in https://github.com/simonw/git-history/issues/30#issuecomment-987285857
enhancement
opened by simonw 2
Ensure that shipping the item _commit fix doesn't break existing databases

An interesting challenge with this change: since it modifies the schema, shipping a release with it could break existing databases the next time git-history file ... is run against them.

This would affect my workflow here for example: https://github.com/simonw/scrape-instances-social/blob/main/.github/workflows/scrape.yml

Originally posted by @simonw in https://github.com/simonw/git-history/issues/59#issuecomment-1321268605
bug

opened by simonw 1
Feature idea: --always columns

Sometimes you might find that you want to record a value every time for a column even while using the mechanism which uses null for values that have not changed - for this project for example: https://github.com/simonw/scrape-instances-social

Idea: a --always colname option which turns this on (and can be applied multiple times).
enhancement

opened by simonw 1

Running with --full-versions twice fails with an error

scrape-instances-social % git-history file counts.db instances.json \
  --convert "
    instances = json.loads(content)
    return [
    {
        'id': 'all',
        'users': sum(d['users'] or 0 for d in instances),
        'statuses': sum(int(d['statuses'] or 0) for d in instances),
        'instances': len(instances)
    }
  ]" --id id --full-versions
  [####################################]  17/17  100%%                                                                                                          scrape-instances-social % 
scrape-instances-social % 
scrape-instances-social % git-history file counts.db instances.json \
  --convert "
    instances = json.loads(content)
    return [
    {
        'id': 'all',
        'users': sum(d['users'] or 0 for d in instances),
        'statuses': sum(int(d['statuses'] or 0) for d in instances),
        'instances': len(instances)
    }
  ]" --id id --full-versions
Traceback (most recent call last):
  File "/Users/simon/.local/bin/git-history", line 8, in <module>
    sys.exit(cli())
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/git_history/cli.py", line 187, in file
    item_id_to_version, item_id_to_last_full_hash = get_versions_and_hashes(
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/git_history/cli.py", line 555, in get_versions_and_hashes
    for row in db.query(sql):
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/sqlite_utils/db.py", line 410, in query
    cursor = self.execute(sql, params or tuple())
  File "/Users/simon/.local/pipx/venvs/git-history/lib/python3.10/site-packages/sqlite_utils/db.py", line 422, in execute
    return self.conn.execute(sql, parameters)
sqlite3.OperationalError: no such column: item_version._item_full_hash

While working on:

https://github.com/simonw/scrape-instances-social/issues/2

bug

opened by simonw 1

Defining composed ids considering new lines as different items
I'm a newbie to the datasette ecosystem and I'm particularly amazed by the git-scraping technique. Thanks Simon for sharing it!

I need help defining a composed id on the rows for this CSV where I'm tracking power outages events in Buenos Aires's metropolitan area every 20'.

https://github.com/OpenDataCordoba/cortes_enre/blob/main/cortes_enre.csv

My problem is that there is no a clear ID of each event and I would like to track changes over it

Consider this recent commit https://github.com/OpenDataCordoba/cortes_enre/commit/b3cde1c1d3b27dc0a76249d0025e3cbe68d914ed

Here it seems I could use all the columns but the last two as a composed id

latitud,longitud,nn,tipo,empresa,partido,localidad,subestacion,alimentador

Then the colums afectados (affected users) and normalizacion estimada (estimated time to normalization) could change during a few next updates, but eventually the line will be deleted.

The problem is that the composed id basically describes the "place" where the outage is happening, and maybe in the future it could be a totally different event in the same place unrelated to the current event.

So, how could I distinguish different events in the same place? I'm wondering if there is a way to consider it's a new item if the composed id appears again (ie the commit is not updating an existing line but adding it).
opened by mgaitan 1
Tables not being created, only `namespaces`

Hi!

I am trying to run git-history on a repository containing a json file that has multiple versions over time ( hundreds of commits to the same file).

When I run git-history file some_data.db data/latest.json --branch master, it creates a some_data.db but there are no commits tables being created:

Only a single namespaces table with a single item is created..

Not sure what I am missing here?

Kind regards, Lasse

opened by lassebenni 3
Floating point numbers seem to always be recorded as changed
In this example:

I don't think latitude and longitude should be populated as they have not changed between records (unlike units).

This is from a demo database built against https://github.com/simonw/scrape-san-mateo-fire-dispatch with:

git-history file history.db incidents.json --id id

Relevant code:

https://github.com/simonw/git-history/blob/ce9e2f161f8037aab8f15dcffb4c7ff8f94ab3b4/git_history/cli.py#L344-L354
bug
opened by simonw 3

Releases(0.6.1)

0.6.1(Dec 8, 2021)
Fixed bug where databases containing multiple file histories created using the --namespace option could have later commits applied to them in subsequent runs. #42, #43

Source code(tar.gz)
Source code(zip)
0.6(Dec 7, 2021)
Fixed critical bug where columns were incorrectly recorded as consistently toggling between null and their current value. #33

Documentation now includes links to live examples of databases created using this tool. #30

--wal option for turning on SQLite WAL mode - useful if you want to safely run queries against the database file while it is still being built. #31

Fixed bug where list and dict values were not correctly compared for equality. #32

The item_version_detail SQL view now includes a _changed_column JSON array of column names that changed in each version. #37

Nested packages such as --import xml.etree.ElementTree can now be imported. #39

item_version._item is now an indexed column. #38

Source code(tar.gz)
Source code(zip)
0.5(Dec 3, 2021)
The item_version table now only records values that have changed since the previous item version. A new item_changed many-to-many table records exactly which columns were changed in which item version, to compensate for ambiguous null values. #21

New --full-versions option for storing full copies of each version instead of storing just the columns that have changed.

Major backwards-incompatible schema changes - see README for details of the new schema.

New --dialect option for specifying a CSV dialect if you don't want to use auto-detection. #27

The history for multiple files can now be stored in the same database, using the new --namespace option. #13

--skip HASH, --start-at HASH and --start-after HASH options for skipping specific Git commits or starting processing at or after a specific hash. #26, #28

Source code(tar.gz)
Source code(zip)
0.4(Nov 21, 2021)
Major changes to the database schema. Foreign keys now use integer primary key IDs rather than using lengthy item or commit hashes, which reduces the database size for large repositories by almost half. #12

Python generators can now be used in --convert functions. #16

Reserved columns are now marked by an underscore prefix, for example _id and _commit. #14

Source code(tar.gz)
Source code(zip)
0.3.1(Nov 12, 2021)
Fixed bug where files with a "rowid" column would fail to import correctly. #10

Source code(tar.gz)
Source code(zip)
0.3(Nov 12, 2021)
git-history file command now shows a progress bar, unless run with the --silent option. #9

Source code(tar.gz)
Source code(zip)
0.2.1(Nov 12, 2021)
Improved usage documentation in the README

Source code(tar.gz)
Source code(zip)
0.2(Nov 12, 2021)
--csv option can now be used to process CSV or TSV data - the dialect is detected automatically. #6

Implemented reserved column names: id, item, version and commit. If your data includes any of these column names they will be renamed to id_, item_, version_ and commit_ respectively. #8

Source code(tar.gz)
Source code(zip)
0.1(Nov 12, 2021)
Initial release. git-history file db.db filepath.json command, see README for details.

Source code(tar.gz)
Source code(zip)

Owner

Simon Willison

GitHub Repository

PathfinderMonsterDatabase - A database of all monsters in Pathfinder 1e, created by parsing aonprd.com

PathfinderMonsterDatabase A database of all monsters in Pathfinder 1e, created by parsing aonprd.com Setup Run the following line to install all requi

11 Jun 12, 2022

Enfilade: Tool to Detect Infections in MongoDB Instances

7 Feb 21, 2022

This repo contains the backend of the KMK project

KMK Backend This repository contains the backend part of the KMK project Demo Watch it on Youtube Getting started Pre-commit hooks After you cloned th

21 Nov 26, 2022

Code for a db backend that relies on bash tools (grep, cat, echo, etc)

Simple-nosql-db is a python backend for a database that relies on unix tools such as cat, echo and grep. Funny enough I got the idea from this discuss

10 Aug 13, 2019

Python object-oriented database

ZODB, a Python object-oriented database ZODB provides an object-oriented database for Python that provides a high-degree of transparency. ZODB runs on

574 Dec 31, 2022

Simple json type database for python3

What it is? Simple json type database for python3! What about speed? The speed is great! All data is stored in RAM until saved. How to install? pip in

3 Feb 11, 2022

A Persistent Embedded Graph Database for Python

Cog - Embedded Graph Database for Python cogdb.io New release: 2.0.5! Installing Cog pip install cogdb Cog is a persistent embedded graph database im

214 Dec 30, 2022

Python function to extract all the rows from a SQLite database file while iterating over its bytes, such as while downloading it

16 Nov 09, 2022

Tools for analyzing Git history using SQLite

Related tags

Overview

git-history

Installation

Usage

De-duplicating items using IDs

CSV and TSV data

Custom conversions using --convert

Development

Comments

Releases(0.6.1)

0.6.1(Dec 8, 2021)

0.6(Dec 7, 2021)

0.5(Dec 3, 2021)

0.4(Nov 21, 2021)

0.3.1(Nov 12, 2021)

0.3(Nov 12, 2021)

0.2.1(Nov 12, 2021)

0.2(Nov 12, 2021)

0.1(Nov 12, 2021)

Owner

Simon Willison

PathfinderMonsterDatabase - A database of all monsters in Pathfinder 1e, created by parsing aonprd.com

Enfilade: Tool to Detect Infections in MongoDB Instances

This repo contains the backend of the KMK project

Code for a db backend that relies on bash tools (grep, cat, echo, etc)

Python object-oriented database

Simple json type database for python3

A Persistent Embedded Graph Database for Python

Python function to extract all the rows from a SQLite database file while iterating over its bytes, such as while downloading it

This is a simple graph database in SQLite, inspired by

Turn SELECT queries returned by a query into links to execute them

A very simple document database

Лабораторные работы по Postgresql за 5 семестр

Oh-My-PickleDB is an open source key-value store using Python's json module.

A Painless Simple Way To Create Schema and Do Database Operations Quickly In Python

A Modular MWDB Utility to Collect Fresh Malware Samples

Youtube Kanalinda tanittigim ve Programladigim SQLite3 ile calisan Kütüphane Programi

ChaozzDBPy - A python implementation based on the original ChaozzDB from Chaozznl with some new features

Decentralised graph database management system

This project is related to a No-SQL database, whose data are referred to autoctone botanic species

Connect Django Project to PostgreSQL