Tools for parsing messy tabular data.

Related tags

Pipelinesmessytables
Overview
Comments
  • HTMLTableSet

    HTMLTableSet

    Hi, here's a HTML Table Set importer for messytables.

    It's not fantastic yet; but it's a pretty good start

    • Supports rowspan/colspan - currently by inserting blank cells.
    • Supports multiple TABLE elements - but may have unexpected behaviour where there are nested tables.
    • Doesn't attempt to handle tables that aren't using TABLE, TR, TD, TH.
    • Not enormously well tested, but seems to work on the tables I've fed it so far.
    • Requires lxml.

    It's the first time I've ever made a pull request; let us know if there's anything we can do to improve it for you.

    opened by scraperdragon 12
  • All releases BROKEN due to json-table-schema name change

    All releases BROKEN due to json-table-schema name change

    json-table-schema is a broken dependency as of yesterday. This affects current and previous releases on pypi.

    To fix this at this end we've changed the dep https://github.com/okfn/messytables/pull/143 and now messytables installs from source again, but it needs a release to pypi. I don't have permission for this.

    (test)[email protected]:/tmp$ pip install messytables
    Downloading/unpacking messytables
      Downloading messytables-0.15.0.tar.gz
      Running setup.py egg_info for package messytables
    
    Downloading/unpacking xlrd>=0.8.0 (from messytables)
      Downloading xlrd-0.9.4.tar.gz (322Kb): 322Kb downloaded
      Running setup.py egg_info for package xlrd
    
    Downloading/unpacking python-magic>=0.4.6 (from messytables)
      Downloading python-magic-0.4.10.tar.gz
      Running setup.py egg_info for package python-magic
    
        no previously-included directories found matching 'test'
    Downloading/unpacking chardet>=2.3.0 (from messytables)
      Downloading chardet-2.3.0.tar.gz (164Kb): 164Kb downloaded
      Running setup.py egg_info for package chardet
    
        warning: no files found matching 'COPYING'
        warning: no files found matching '*.html' under directory 'docs'
        warning: no files found matching '*.css' under directory 'docs'
        warning: no files found matching '*.png' under directory 'docs'
        warning: no files found matching '*.gif' under directory 'docs'
    Downloading/unpacking python-dateutil>=2.4.2 (from messytables)
      Downloading python-dateutil-2.4.2.tar.gz (209Kb): 209Kb downloaded
      Running setup.py egg_info for package python-dateutil
    
    Downloading/unpacking lxml>=3.2 (from messytables)
      Downloading lxml-3.5.0b1.tar.gz (3.8Mb): 3.8Mb downloaded
      Running setup.py egg_info for package lxml
        Building lxml version 3.5.0b1.
        Building without Cython.
        Using build configuration of libxslt 1.1.26
        Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu
    
        warning: no previously-included files found matching '*.py'
    Downloading/unpacking requests (from messytables)
      Downloading requests-2.8.1.tar.gz (480Kb): 480Kb downloaded
      Running setup.py egg_info for package requests
    
    Downloading/unpacking html5lib (from messytables)
      Downloading html5lib-1.0b8.tar.gz (889Kb): 889Kb downloaded
      Running setup.py egg_info for package html5lib
    
    Downloading/unpacking json-table-schema>=0.2 (from messytables)
      Downloading json-table-schema-0.5.0.tar.gz
      Running setup.py egg_info for package json-table-schema
        json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.
        Traceback (most recent call last):
          File "<string>", line 14, in <module>
          File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>
            with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:
        IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'
        Complete output from command python setup.py egg_info:
        json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.
    
    Traceback (most recent call last):
    
      File "<string>", line 14, in <module>
    
      File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>
    
        with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:
    
    IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'
    
    ----------------------------------------
    Command python setup.py egg_info failed with error code 1 in /tmp/test/build/json-table-schema
    Storing complete log in /home/co/.pip/pip.log
    
    opened by davidread 11
  • Getting messytables to run on Python 3

    Getting messytables to run on Python 3

    Does any know, informally or otherwise, what it will take to get messytables running on Python 3?

    I'm keen to use various functions and modules from messytables, but I'm trying to maintain 2.7/3.3/3.4 support in my own libraries.

    opened by pwalsh 11
  • Application for maintainership

    Application for maintainership

    Hey all. This repository seems to be semi-inactive, and it unclear to me what the path to merging a PR like #171 is (who would have to approve?). I use messytables in production code day to day, and this lack of clarity on process makes the library a liability. My understanding is that okfn's resources and interest is focussed on goodtables and the frictionlessdata toolchain.

    I would therefore like to apply to become the maintainer for messytables, merge #171 & co., and generally make sure that changes in this thing are handled and bugs are actively tracked.

    Thoughts, @pwalsh, @davidread, @rufuspollock? Please let me know.

    opened by pudo 10
  • TypeError(

    TypeError("object of type 'float' has no len()",) when calling type_guess

    I could trace this back to #141 where len() is being used in the test() method of DateUtilType.

    I think there should be a try/except block around that, that catches this TypeError. But I'm not too familiar with the code, so I'm basically asking if you agree, or if I'm missing something.

    I'm happy to provide the PR.

    BTW: I'm getting this error via datapusher on some Excel sheet that is being parsed with the default parameters. The excel sheet has indeed a lot of float values in it.

    opened by metaodi 10
  • [discussion] messytables should *only* work with local files

    [discussion] messytables should *only* work with local files

    Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

    The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

    Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).

    question 
    opened by rossjones 10
  • Support for PDF format

    Support for PDF format

    We've been exploring different options for parsing PDFs. Currently we're using an (alpha) in-house library called pdftables (we blogged about it here)

    This pull request integrates pdftables into messytables. It is an optional requirement - if pdftables is not installed, messytables will work as usual and the PDF tests will be skipped.

    We're looking into other ways of extracting tables from PDFs, but either way we'll need the messytables integration.

    opened by fawkesley 9
  • [WIP] Support for ODS files.

    [WIP] Support for ODS files.

    A reworked reader for ODS files that doesn't use any broken third-party libraries. Reads the .xml directly from the zipfile and performs much better on larger spreadsheets.

    opened by rossjones 9
  • libmagic error following messytables overview

    libmagic error following messytables overview

    I'm based off of http://messytables.readthedocs.org/en/latest/ but have also looked at the GitHub readme, etc. Couldn't find any actual install instructions anywhere, but here's what I did.

    Environment: Mac OS X latest, up to date homebrew

    1. pip install messytables

    2. brew install libmagic

    3. The following Python:

      % python                
      Python 2.7.6 (default, Nov 14 2013, 09:55:56) 
      [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import messytables
      >>> messytables.any_tableset(open('README.txt', 'rb'))
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 138, in any_tableset
          magic_mime = get_mime(fileobj)
        File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 38, in get_mime
          mimetype = magic.from_buffer(header, mime=True)
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 103, in from_buffer
          def __init__(self, ms):
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 94, in _get_magic_type
          _list = _libraries['magic'].magic_list
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 83, in _get_magic_mime
          _load.restype = c_int
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 51, in __init__
          magic_set._fields_ = []
        File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 138, in errorcheck
          except:
      magic.MagicException: no magic files loaded
      
    4. README.txt:

      ============
      README
      ============
      
      A single-line README file.
      
    opened by dhalperi 8
  • Remove openpyxl, use XLSTableSet for XLSX files

    Remove openpyxl, use XLSTableSet for XLSX files

    Phase 1 of 2 for completely removing openpyxl and using XLSTableSet instead. (Phase 2 will actually remove the dependency and excelx.py, then you won't be able to reference XLSXTableSet)

    If you always use any_tableset it'll just work correctly - you'll now get back an XLSTableSet instead of an XLSXTableSet.

    I've left the latter in with a DeprecationWarning (and test) in order to remain compatible with code written with explicity XLSXTableSet.

    I'm feeling like we should encourage people towards only using any_tableset (perhaps with an argument to override force the type detection). It's quite awkward that currently our users are needlessly coupling to our class naming convention. Unless I've missed a use-case - any compelling reasons to allow that?

    Not ready to merge yet I suspect. Closes #83

    opened by fawkesley 8
  • 65 rework of detection in any.py

    65 rework of detection in any.py

    We were having problems with any.py, so I rewrote it.

    Features:

    • new extension detection function (you can pass a whole filename/URL)
    • nice lists of mimetypes/extensions parsed
    • special pleading for XLS/XLSX files :(
    • tests for autodetection
    • various fixes
    opened by scraperdragon 8
  • Failure to load with Python 3.10

    Failure to load with Python 3.10

    Attempting to use messytables with Python 3.10 results in the following error:

      File "/layers/google.python.pip/pip/lib/python3.10/site-packages/messytables/core.py", line 2, in <module>
        from collections import Mapping
    ImportError: cannot import name 'Mapping' from 'collections' (/opt/python3.10/lib/python3.10/collections/__init__.py)
    

    This is due to Mapping moving to package collections.abc in Python 3.10.

    core.py should be updated to take account of this.

    opened by davidharcombe 0
  • Bump lxml from 4.3.4 to 4.9.1

    Bump lxml from 4.3.4 to 4.9.1

    Bumps lxml from 4.3.4 to 4.9.1.

    Changelog

    Sourced from lxml's changelog.

    4.9.1 (2022-07-01)

    Bugs fixed

    • A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

    4.9.0 (2022-06-01)

    Bugs fixed

    • GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

    Other changes

    • Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

    • Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

    • GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

    4.8.0 (2022-02-17)

    Features added

    • GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

    • The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

    Bugs fixed

    • GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

    ... (truncated)

    Commits
    • d01872c Prevent parse failure in new test from leaking into later test runs.
    • d65e632 Prepare release of lxml 4.9.1.
    • 86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...
    • 50c2764 Delete unused Travis CI config and reference in docs (GH-345)
    • 8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...
    • b9f7074 Remove debug print from test.
    • b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...
    • 897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...
    • 853c9e9 Prepare release of 4.9.0.
    • d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • messytables guesses wrong type for decimal number

    messytables guesses wrong type for decimal number

    Describe the bug Messytables should guess decimals correctly respecting the locale configuration. For example: In germany the , is used as decimal dot but a value 1,200 is guessed as type "text".

    This issue was initially reported as ckan issue https://github.com/ckan/ckan/issues/5769 where I recognized it.

    The type guessing seems to happen here: https://github.com/okfn/messytables/blob/51b736892a48e420ab313675f54901c77b446dec/messytables/types.py and seems to happen locale specific. (I think the magic happens in line 100: value = locale.atof(value)

    Unfortunately python seems to recognizes a dot as decimal point even if a german locale is set, which I could reproduce in my local environment:

    >>> locale.getlocale()
    ('de_DE', 'cp1252')
    >>> locale.atof('1,200')
    
    Traceback (most recent call last):
      File "<pyshell#35>", line 1, in <module>
        locale.atof('1,200')
      File "C:\Program Files\Python27\lib\locale.py", line 318, in atof
        return func(string)
    ValueError: invalid literal for float(): 1,200
    >>> locale.localeconv()
    {'mon_decimal_point': '', 'int_frac_digits': 127, 'p_sep_by_space': 127, 'frac_digits': 127, 'thousands_sep': '', 'n_sign_posn': 127, 'decimal_point': '.', 'int_curr_symbol': '', 'n_cs_precedes': 127, 'p_sign_posn': 127, 'mon_thousands_sep': '', 'negative_sign': '', 'currency_symbol': '', 'n_sep_by_space': 127, 'mon_grouping': [], 'p_cs_precedes': 127, 'positive_sign': '', 'grouping': []}
    
    opened by wrinklenose 1
  • test_attempt_read_encrypted_no_password_xls failure in Python 3.7+

    test_attempt_read_encrypted_no_password_xls failure in Python 3.7+

    This line specifies an error message. In the test, the text of the exception caused by the code under test is expected to match exactly.

    errmsg = "Can't read Excel file: XLRDError('Workbook is encrypted',)"
    

    When running tests on Python 3.7 and 3.8 this fails, because their outputs do not contain the comma (probably due to this change in Python 3.7, I'm guessing).

    opened by StevenMaude 0
  • requirements-test.txt should have xlrd==1.2.0 (or >=) for Python 3.8+ tests

    requirements-test.txt should have xlrd==1.2.0 (or >=) for Python 3.8+ tests

    opened by StevenMaude 0
Releases(0.15.1)
Owner
Open Knowledge Foundation
Also find us at: @frictionlessdata @opentrials @openspending @openknowledge-archive
Open Knowledge Foundation
Directions overlay for working with pandas in an analysis environment

dovpanda Directions OVer PANDAs Directions are hints and tips for using pandas in an analysis environment. dovpanda is an overlay companion for workin

dovpandev 431 Dec 20, 2022
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

Chris Riederer 754 Nov 21, 2022
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Reuben Cummings 401 Dec 19, 2022
Tools for parsing messy tabular data.

Parsing for messy tables A library for dealing with messy tabular data in several formats, guessing types and detecting headers. See the documentation

Open Knowledge Foundation 382 Nov 10, 2022
Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 01, 2023
Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023
BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

Data Analysis Center 185 Dec 20, 2022
Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022
Pandas integration with sklearn

Sklearn-pandas This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides

2.7k Dec 27, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hoste

Riya Vijay Vishwakarma 1 Dec 12, 2021