Python bindings for the simdjson project.

Overview

PyPI - License Tests

pysimdjson

Python bindings for the simdjson project, a SIMD-accelerated JSON parser. If SIMD instructions are unavailable a fallback parser is used, making pysimdjson safe to use anywhere.

Bindings are currently tested on OS X, Linux, and Windows for Python version 3.5 to 3.9.

📝 Documentation

The latest documentation can be found at https://pysimdjson.tkte.ch.

If you've checked out the source code (for example to review a PR), you can build the latest documentation by running cd docs && make html.

🎉 Installation

If binary wheels are available for your platform, you can install from pip with no further requirements:

pip install pysimdjson

Binary wheels are available for the following:

py3.5 py3.6 py3.7 py3.8 py3.9 pypy3
OS X (x86_64) y y y y y y
Windows (x86_64) x x y y y x
Linux (x86_64) y y y y y x
Linux (ARM64) y y y y y x

If binary wheels are not available for your platform, you'll need a C++11-capable compiler to compile the sources:

pip install pysimdjson --no-binary :all:

Both simdjson and pysimdjson support FreeBSD and Linux on ARM when built from source.

Development and Testing

This project comes with a full test suite. To install development and testing dependencies, use:

pip install -e ".[test]"

To also install 3rd party JSON libraries used for running benchmarks, use:

pip install -e ".[benchmark]"

To run the tests, just type pytest. To also run the benchmarks, use pytest --runslow.

To properly test on Windows, you need both a recent version of Visual Studio (VS) as well as VS2015, patch 3. Older versions of CPython required portable C/C++ extensions to be built with the same version of VS as the interpreter. Use the Developer Command Prompt to easily switch between versions.

How It Works

This project uses pybind11 to generate the low-level bindings on top of the simdjson project. You can use it just like the built-in json module, or use the simdjson-specific API for much better performance.

import simdjson
doc = simdjson.loads('{"hello": "world"}')

🚀 Making things faster

pysimdjson provides an api compatible with the built-in json module for convenience, and this API is pretty fast (beating or tying all other Python JSON libraries). However, it also provides a simdjson-specific API that can perform significantly better.

Don't load the entire document

95% of the time spent loading a JSON document into Python is spent in the creation of Python objects, not the actual parsing of the document. You can avoid all of this overhead by ignoring parts of the document you don't want.

pysimdjson supports this in two ways - the use of JSON pointers via at_pointer(), or proxies for objects and lists.

import simdjson
parser = simdjson.Parser()
doc = parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')

For our sample above, we really just want the second entry in res, we don't care about anything else. We can do this two ways:

assert doc['res'][1]['name'] == 'second' # True
assert doc.at_pointer('res/1/name') == 'second' # True

Both of these approaches will be much faster than using load/s(), since they avoid loading the parts of the document we didn't care about.

Both Object and Array have a mini property that returns their entire content as a minified Python str. A message router for example would only parse the document and retrieve a single property, the destination, and forward the payload without ever turning it into a Python object. Here's a (bad) example:

import simdjson

@app.route('/store', methods=['POST'])
def store():
    parser = simdjson.Parser()
    doc = parser.parse(request.data)
    redis.set(doc['key'], doc.mini)

With this, doc could contain thousands of objects, but the only one loaded into a python object was key, and we even minified the content as we went.

Re-use the parser.

One of the easiest performance gains if you're working on many documents is to re-use the parser.

import simdjson
parser = simdjson.Parser()

for i in range(0, 100):
    doc = parser.parse(b'{"a": "b"}')

This will drastically reduce the number of allocations being made, as it will reuse the existing buffer when possible. If it's too small, it'll grow to fit.

📈 Benchmarks

pysimdjson compares well against most libraries for the default load/loads(), which creates full python objects immediately.

pysimdjson performs significantly better when only part of the document is of interest. For each test file we show the time taken to completely deserialize the document into Python objects, as well as the time to get the deepest key in each file. The second approach avoids all unnecessary object creation.

jsonexamples/canada.json deserialization

Name Min (μs) Max (μs) StdDev Ops
simdjson-{canada} 10.67130 22.89260 0.00465 60.30257
yyjson-{canada} 11.29230 29.90640 0.00568 53.27890
orjson-{canada} 11.90260 34.88260 0.00507 54.49605
ujson-{canada} 18.17060 48.99410 0.00718 36.24892
simplejson-{canada} 39.24630 52.62860 0.00483 21.81617
rapidjson-{canada} 41.04930 53.10800 0.00445 21.19078
json-{canada} 44.68320 59.44410 0.00440 19.71509

jsonexamples/canada.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{canada} 3.21360 6.88010 0.00044 285.83978
yyjson-{canada} 10.62770 46.10050 0.01000 43.29310
orjson-{canada} 12.54010 39.16080 0.00779 44.28928
ujson-{canada} 17.93980 35.44960 0.00697 36.78481
simplejson-{canada} 38.58160 54.33290 0.00699 21.37382
rapidjson-{canada} 40.69030 58.23460 0.00700 20.30349
json-{canada} 43.88300 65.04480 0.00722 18.55929

jsonexamples/twitter.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{twitter} 2.36070 14.03050 0.00123 346.94307
simdjson-{twitter} 2.41350 12.01550 0.00117 359.49272
yyjson-{twitter} 2.48130 12.03680 0.00112 353.03313
ujson-{twitter} 2.62890 11.39370 0.00090 346.87994
simplejson-{twitter} 3.34600 11.08840 0.00098 270.58797
json-{twitter} 3.35270 11.82610 0.00116 260.01943
rapidjson-{twitter} 4.29320 13.81980 0.00128 197.91107

jsonexamples/twitter.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{twitter} 0.33840 0.67200 0.00002 2800.32496
orjson-{twitter} 2.38460 13.53120 0.00131 352.70788
yyjson-{twitter} 2.48180 13.67470 0.00156 320.56731
ujson-{twitter} 2.65230 11.65150 0.00125 331.69430
json-{twitter} 3.34910 12.44890 0.00116 263.25854
simplejson-{twitter} 3.35760 15.61900 0.00137 262.36758
rapidjson-{twitter} 4.31870 12.77490 0.00119 201.86510

jsonexamples/github_events.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{github_events} 0.18080 0.67020 0.00004 5041.29485
simdjson-{github_events} 0.19470 0.61450 0.00003 4725.63489
yyjson-{github_events} 0.19710 0.53970 0.00004 4584.50870
ujson-{github_events} 0.23760 1.33490 0.00004 3904.08715
json-{github_events} 0.29030 1.32040 0.00009 3034.22530
simplejson-{github_events} 0.30210 0.82260 0.00005 3067.99997
rapidjson-{github_events} 0.33010 0.92400 0.00005 2793.93274

jsonexamples/github_events.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{github_events} 0.03630 0.66110 0.00001 25259.19598
orjson-{github_events} 0.18210 0.71230 0.00003 5073.48086
yyjson-{github_events} 0.20030 0.61270 0.00003 4589.71299
ujson-{github_events} 0.24260 1.05100 0.00007 3644.08240
json-{github_events} 0.29310 2.38770 0.00011 2967.79019
simplejson-{github_events} 0.30580 1.39670 0.00007 2931.01646
rapidjson-{github_events} 0.33340 0.80440 0.00004 2795.27887

jsonexamples/citm_catalog.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{citm_catalog} 5.40140 17.76900 0.00314 130.33847
yyjson-{citm_catalog} 5.77340 23.09490 0.00421 113.78942
simdjson-{citm_catalog} 6.00620 26.87570 0.00444 104.41073
ujson-{citm_catalog} 6.34300 25.06400 0.00473 96.01414
simplejson-{citm_catalog} 9.54910 23.96350 0.00392 78.99315
json-{citm_catalog} 10.21250 23.52610 0.00329 78.72180
rapidjson-{citm_catalog} 10.81700 21.85400 0.00343 73.94939

jsonexamples/citm_catalog.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{citm_catalog} 0.81040 2.11090 0.00015 1088.17698
orjson-{citm_catalog} 5.37260 18.37890 0.00451 120.86345
yyjson-{citm_catalog} 5.61430 23.18500 0.00548 110.29924
ujson-{citm_catalog} 6.25850 30.79090 0.00604 95.50805
simplejson-{citm_catalog} 9.36560 24.44860 0.00510 77.50571
json-{citm_catalog} 10.07650 25.29490 0.00450 76.18267
rapidjson-{citm_catalog} 10.69120 27.84880 0.00493 70.98005

jsonexamples/mesh.json deserialization

Name Min (μs) Max (μs) StdDev Ops
yyjson-{mesh} 2.33710 13.01130 0.00171 331.50569
simdjson-{mesh} 2.52960 13.19230 0.00159 311.37935
orjson-{mesh} 2.88770 12.13010 0.00152 287.31080
ujson-{mesh} 3.64020 18.23620 0.00227 193.35645
json-{mesh} 5.97130 13.58290 0.00136 150.01621
rapidjson-{mesh} 7.54270 16.14480 0.00155 119.37806
simplejson-{mesh} 8.64370 16.35320 0.00136 106.25888

jsonexamples/mesh.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{mesh} 1.02020 2.74930 0.00013 919.93044
yyjson-{mesh} 2.30970 13.06730 0.00182 347.76076
orjson-{mesh} 2.85260 12.41860 0.00156 290.19432
ujson-{mesh} 3.59400 16.68610 0.00227 201.03704
json-{mesh} 5.96300 19.18900 0.00185 146.04645
rapidjson-{mesh} 7.43860 16.32260 0.00164 121.84979
simplejson-{mesh} 8.62160 21.89280 0.00221 101.30905

jsonexamples/gsoc-2018.json deserialization

Name Min (μs) Max (μs) StdDev Ops
simdjson-{gsoc-2018} 5.52590 16.27430 0.00178 145.59797
yyjson-{gsoc-2018} 5.62040 16.46250 0.00168 155.97459
orjson-{gsoc-2018} 5.78420 13.87300 0.00140 148.84293
simplejson-{gsoc-2018} 7.76200 15.26480 0.00142 114.98827
ujson-{gsoc-2018} 7.96570 21.53840 0.00188 110.29162
json-{gsoc-2018} 8.63300 19.26320 0.00172 102.78744
rapidjson-{gsoc-2018} 10.55570 19.20210 0.00159 85.84087

jsonexamples/gsoc-2018.json deepest key

Name Min (μs) Max (μs) StdDev Ops
simdjson-{gsoc-2018} 1.56020 4.20200 0.00024 570.15046
yyjson-{gsoc-2018} 5.49930 14.89760 0.00158 161.14242
orjson-{gsoc-2018} 5.72650 15.88270 0.00160 153.18169
simplejson-{gsoc-2018} 7.70780 18.78120 0.00169 116.90299
ujson-{gsoc-2018} 7.91720 21.35300 0.00227 103.06755
json-{gsoc-2018} 8.65190 19.99580 0.00188 103.86934
rapidjson-{gsoc-2018} 10.52410 20.98870 0.00158 87.78973
Comments
  • Rewrite for code quality and move to simdjson 0.4.*. (Issue #31)

    Rewrite for code quality and move to simdjson 0.4.*. (Issue #31)

    This will become the version 2.0.0 release.

    • [x] Update embedded simdjson to 0.3.0 (#31)
    • [x] Update embedded simdjson to 0.4.0 (#31)
    • [x] Move from cython to pybind11
    • [ ] Rewrite documentation
    • [ ] Better CI-generated benchmarks against json, ujson, rapidjson, and orjson.
    • [x] Try to match the json.load, json.loads, json.dump and json.dumps interfaces. Will impact performance over the native simdjson API but users want plug-and-play.
    • [x] Move from appveyor and circleci to github actions for CI tasks.
    • [x] simdjson no longer requires C++17. We can greatly expand the versions of Python on Windows we can provide binary wheels for. This comes from older versions of CPython requiring C extensions to be built with the same compiler they were.
    packaging 
    opened by TkTech 44
  • The Python overhead is about 95% of the processing time

    The Python overhead is about 95% of the processing time

    From simdjson/scripts/javascript, I generated a file called large.json. In C++, parsing this file takes about 0.25 s.

    $parse large.json
    Min:  0.252188 bytes read: 203130424 Gigabytes/second: 0.805471
    

    I wrote the following Python script...

    from timeit import default_timer as timer
    with open('large.json', 'rb') as fin:
       x = fin.read()
    
    for i in range(10):
       start = timer()
       doc = simdjson.loads(x)
       end = timer()
       print(end - start)
    

    I get...

    $ time python3 test.py
    3.471898762974888
    3.9210079659242183
    3.3614078611135483
    3.72252986789681
    3.7506914171390235
    3.756883286871016
    3.752689895918593
    3.751842977013439
    3.7484844669234008
    (...)
    

    If my analysis is correct (and it could be wrong), pysimdjson takes 3.7 s to parse the file, and of that, 0.25 s are due to simdjson, leaving about 95% of the processing time to overhead.

    I know that this is known, but I wanted to provide a data point.

    opened by lemire 24
  • This parser can't support a document that big

    This parser can't support a document that big

    [email protected]:~$ time python convert-to-pickle.py Traceback (most recent call last): File "convert-to-pickle.py", line 10, in data = simdjson.loads(ch.read()) File "/usr/local/lib/python3.8/dist-packages/simdjson/init.py", line 52, in loads return parser.parse(s, True) File "simdjson/csimdjson.pyx", line 468, in csimdjson.Parser.parse ValueError: This parser can't support a document that big

    invalid zero-effort 
    opened by ghost 17
  • File causes a crash in pysimdjson (reliably)

    File causes a crash in pysimdjson (reliably)

    I am copying over issue https://github.com/simdjson/simdjson/issues/921 from simdjson. We do not see a crash in simdjson itself, but there is a crash in pysimdjson:

    import simdjson
    a = open("test.txt").read()
    b = simdjson.loads(a.encode())
    

    Using the file https://github.com/simdjson/simdjson/files/4749603/test.txt

    opened by lemire 17
  • Unable to serialize simdjson Objects into Pickle

    Unable to serialize simdjson Objects into Pickle

    Hello all!

    When I try to serialize simdjson Object into Pickle, I get the following error:

    TypeError: self.c_element,self.c_parser cannot be converted to a Python object for pickling

    Would it be possible to add support for serializing/pickling simdjson instances directly, without converting them to dict? If not pickling, then at least an ability to serialize into .json would be lovely as well.

    opened by vovavili 9
  • Fairly high overhead on the boundary Python/C++

    Fairly high overhead on the boundary Python/C++

    We are parsing a very high number of ~2KB JSON files in our Python-based application.

    • The native (C++) SIMDJSON library delivers ~700k parser cycles per second.
    • The pysimdjson delivers ~350k parser cycles per second.
    • The Cython-based PoC implementation (in-house, so far) delivers ~700k parser cycles per second (very close to C++ implementation).

    I also conducted a rather artificial test of "how many parser cycles" can I get with basically empty JSON ({}). The issue here is quite visible, the overhead of the Python<->pysymdjson boundary crossing is high relatively to other possible implementations.

    A "parser cycle" is defined as a one call to parser.parse(json) on the existing parser instance.

    I'm not 100% sure if this is a priority of this library, so feel free to close this one as irrelevant.

    opened by ateska 9
  • Segfault when not assigning the parser to a variable

    Segfault when not assigning the parser to a variable

    Here is a Python session that segfaults:

    >>> import simdjson
    >>> pa=simdjson.Parser().parse('{"a": 9999}')
    >>> pa["a"]
    zsh: segmentation fault (core dumped)  python
    

    And here is one that works:

    >>> import simdjson
    >>> p = simdjson.Parser()
    >>> pa = p.parse('{"a": 9999}')
    >>> pa["a"]
    9999
    

    It's unclear to me why the first one segfaults, and it looks like a bug?

    I imagine the parser is garbage collected by Python in the first example, but it's still clearly in use by the "pa" variable?

    bug 
    opened by palkeo 9
  • Consider upgrading to simdjson 0.4

    Consider upgrading to simdjson 0.4

    Version 0.4 of simdjson is now available

    Highlights

    • Test coverage has been greatly improved and we have resolved many static-analysis warnings on different systems.

    New features:

    • We added a fast (8GB/s) minifier that works directly on JSON strings.
    • We added fast (10GB/s) UTF-8 validator that works directly on strings (any strings, including non-JSON).
    • The array and object elements have a constant-time size() method.

    Performance:

    • Performance improvements to the API (type(), get<>()).
    • The parse_many function (ndjson) has been entirely reworked. It now uses a single secondary thread instead of several new threads.
    • We have introduced a faster UTF-8 validation algorithm (lookup3) for all kernels (ARM, x64 SSE, x64 AVX).

    System support:

    • C++11 support for older compilers and systems.
    • FreeBSD support (and tests).
    • We support the clang front-end compiler (clangcl) under Visual Studio.
    • It is now possible to target ARM platforms under Visual Studio.
    • The simdjson library will never abort or print to standard output/error.

    Version 0.3 of simdjson is now available

    Highlights

    • Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually. API docs / Design Details
    • Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. Docs
    • Exact Float Parsing: Now simdjson parses floats flawlessly without any performance loss (https://github.com/simdjson/simdjson/pull/558). Blog Post
    • Even Faster: The fastest parser got faster! With a shiny new UTF-8 validator and meticulously refactored SIMD core, simdjson 0.3 is 15% faster than before, running at 2.5 GB/s (where 0.2 ran at 2.2 GB/s).

    Minor Highlights

    • Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines.
    • Automatic allocation: as part of API simplification, the parser no longer has to be preallocated-it will adjust automatically when it encounters larger files.
    • Runtime selection API: We've exposed simdjson's runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.
    • Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code and cast straight to T, and exceptions will be thrown automatically if an error happens. Use the same API either way!
    • Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.
    opened by lemire 8
  • Windows 3.6 Binary?

    Windows 3.6 Binary?

    Hi! Thanks again for this fantastic project. ^_^

    I ran into some CI errors where my Windows 64-bit builds were dying due to compile errors with CPython 3.6. I noticed there's no wheel on PyPI for it.

    Would it be possible to fix?

    Thanks!

    enhancement packaging 
    opened by william-silversmith 6
  • Pysimdjson fails to install on python 3.6

    Pysimdjson fails to install on python 3.6

      Using cached https://files.pythonhosted.org/packages/9b/f6/c63260f8788574de8fdd0bbe70f803328cb058141c0903ba29637d89f863/pysimdjson-2.5.0.tar.gz
    Installing collected packages: pysimdjson
      Running setup.py install for pysimdjson ... error
        Complete output from command /home/ubuntu/ctix-2/venv/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-wzmco2i3/pysimdjson/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-jelehspx/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/ctix-2/venv/include/site/python3.6/pysimdjson:
        running install
        running build
        running build_py
        creating build
        creating build/lib.linux-x86_64-3.6
        creating build/lib.linux-x86_64-3.6/simdjson
        copying simdjson/__init__.py -> build/lib.linux-x86_64-3.6/simdjson
        running build_ext
        building 'csimdjson' extension
        creating build/temp.linux-x86_64-3.6
        creating build/temp.linux-x86_64-3.6/simdjson
        x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include -I/home/ubuntu/ctix-2/venv/include -I/usr/include/python3.6m -c simdjson/binding.cpp -o build/temp.linux-x86_64-3.6/simdjson/binding.o -std=c++11
        In file included from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/pytypes.h:12:0,
                         from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/cast.h:13,
                         from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/attr.h:13,
                         from /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/pybind11.h:44,
                         from simdjson/binding.cpp:5:
        /home/ubuntu/ctix-2/venv/lib/python3.6/site-packages/pybind11/include/pybind11/detail/common.h:112:20: fatal error: Python.h: No such file or directory
        compilation terminated.
        error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    
        ----------------------------------------
    Command "/home/ubuntu/ctix-2/venv/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-wzmco2i3/pysimdjson/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-jelehspx/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/ctix-2/venv/include/site/python3.6/pysimdjson" failed with error code 1 in /tmp/pip-install-wzmco2i3/pysimdjson/
    You are using pip version 19.0, however version 20.2.4 is available.
    You should consider upgrading via the 'pip install --upgrade pip' command.```
    opened by anudeepsamaiya 6
  • Build binary packages using clang-cl on Windows

    Build binary packages using clang-cl on Windows

    Support for clang-cl is coming. As part of the PR that allows CPython to build against clang-cl, distutils is updated to build with clang-cl (https://github.com/python/cpython/pull/18371). Once this PR is merged and a new CPython release includes it we can start using it for our binary releases.

    Clang has reached a point where it's safe enough for us to use with CPython's built with MSVC2015 or newer. https://clang.llvm.org/docs/MSVCCompatibility.html

    This would alleviate poor windows performance caused by MSVC issues (https://github.com/simdjson/simdjson/issues/847, but not entirely, https://github.com/simdjson/simdjson/issues/848).

    We only need to do this if upstream simdjson doesn't figure out what's up with MSVC. @lemire

    enhancement packaging blocked 
    opened by TkTech 6
  • Float aware mini

    Float aware mini

    simdjson minify drops the trailing '.0' from floats, which is fine by JSON spec, but matters in practice. For example, Elasticsearch dynamic field type detection is affected. In general, Python distinguishes between int and float, so various type guarantees may fail. The dump/load cycle should not convert types for a few byte gain. Let users explicitly convert types, if they need to.

    This modifies minify, so it does not drop the '.0'.

    Note: simdjson started dropping '.0' with d0821adf0e7934f27a8eb5c2fe9b8254e4.

    opened by edgarsi 8
  • Performance penalty when reading items

    Performance penalty when reading items

    I'm getting increased latency in my application from simdjson but I can't figure out why.

    This is a snippet from profiling the function that gets items from the simdjson object. The time is in seconds.

       Ordered by: internal time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
           26    0.017    0.001    0.017    0.001 {method 'get' of 'csimdjson.Object' objects}
    
    

    When I time getting individual items from the same object I get timings of about 15 microseconds which seems comparable with getting items from a normal python dictionary. However when I test the whole function the performance is much worse.

    opened by jonathan-kosgei 1
  • Improve user experience of memory safety.

    Improve user experience of memory safety.

    We've added a check in v4 (https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437) that prevents parsing new documents while references continue to exist to the old one. This is correct, in that it ensures no errors. I wasn't terribly happy with this, but it's better then segfaulting.

    It has downsides:

    • It sucks as a user (https://github.com/TkTech/pysimdjson/issues/53#issuecomment-850494991), where you might have to del the old objects, even if you didn't intend to use them again. Very un-pythonic.
    • Doesn't work on PyPy, where del is unreliable. The objects may not be garbage collected until much later.

    Brainstorming welcome. Alternatives:

    • Probably the easiest approach would be for a Parser to keep a list of Object and Array proxies that hold a reference to it, and set a dirty bit on them when parse() is called with a different document. The performance of this would probably be unacceptable - I might be wrong.
    • Use the new parse_into_document() and create a new document for every parse. This is potentially both slow and very wasteful with memory, but would let us keep a document around and valid for as long as Object or Array reference it.
    enhancement help wanted 
    opened by TkTech 3
  • Provide the ability to link to system simdjson

    Provide the ability to link to system simdjson

    Bundling a library is a serious sin in our book, so provide the ability to link to the system library. I've also done some refactoring to avoid exponential growth of Extension calls. The default behavior remains the same, so it shouldn't affect existing users.

    That said, the patch isn't perfect. It still uses the bundled headers instead of system headers but it should be good enough for us.

    opened by mgorny 2
  • Expose document_stream interface

    Expose document_stream interface

    The pysimdjson library could support our document_stream interface (parse_many function). It is well tested as of release 0.7 (with fuzz testing) and works well today. It supports streams of indefinite size.

    See https://github.com/simdjson/simdjson/blob/master/doc/parse_many.md

    Related to https://github.com/TkTech/pysimdjson/issues/70

    enhancement 
    opened by lemire 4
Releases(v4.0.0)
Python wrapper around rapidjson

python-rapidjson Python wrapper around RapidJSON Authors: Ken Robbins [email prot

469 Jan 04, 2023
Python bindings for the simdjson project.

pysimdjson Python bindings for the simdjson project, a SIMD-accelerated JSON parser. If SIMD instructions are unavailable a fallback parser is used, m

Tyler Kennedy 562 Jan 01, 2023
Yet another serialization library on top of dataclasses, inspired by serde-rs.

pyserde Yet another serialization library on top of dataclasses, inspired by serde-rs. Guide | API Docs | Examples Overview Declare a class with pyser

yukinarit 164 Jan 05, 2023
Ultra fast JSON decoder and encoder written in C with Python bindings

UltraJSON UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 3.6+. Install with pip: $ python -m pip insta

3.9k Dec 31, 2022
A lightweight library for converting complex objects to and from simple Python datatypes.

marshmallow: simplified object serialization marshmallow is an ORM/ODM/framework-agnostic library for converting complex datatypes, such as objects, t

marshmallow-code 6.4k Jan 05, 2023