tabulator-py

A library for reading and writing tabular data (csv/xls/json/etc).

[Important Notice] We have released Frictionless Framework. This framework is logical continuation of tabulator that was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the Migration Guide from tabulator to Frictionless Framework.

we continue to bug-fix [email protected] in this repository as well as it's available on PyPi as it was before

please note that [email protected] version's API, we're working on at the moment, is not stable

we will release [email protected] by the end of 2020 to be the first SemVer/stable version

Features

Supports most common tabular formats: CSV, XLS, ODS, JSON, Google Sheets, SQL, and others. See complete list below.
Loads local and remote data: Supports HTTP, FTP and S3.
Low memory usage: Only the current row is kept in memory, so you can large datasets.
Supports compressed files: Using ZIP or GZIP algorithms.
Extensible: You can add support for custom file formats and loaders (e.g. FTP).

Getting started
Documentation
API Reference
Contributing
Changelog

Getting started

Installation

$ pip install tabulator

Running on CLI

Tabulator ships with a simple CLI called tabulator to read tabular data. For example:

$ tabulator https://github.com/frictionlessdata/tabulator-py/raw/4c1b3943ac98be87b551d87a777d0f7ca4904701/data/table.csv.gz
id,name
1,english
2,中国人

You can see all supported options by running tabulator --help.

Running on Python

from tabulator import Stream

with Stream('data.csv', headers=1) as stream:
    stream.headers # [header1, header2, ..]
    for row in stream:
        print(row)  # [value1, value2, ..]

You can find other examples in the examples directory.

Documentation

In the following sections, we'll walk through some usage examples of this library. All examples were tested with Python 3.6, but should run fine with Python 3.3+.

Working with Stream

The Stream class represents a tabular stream. It takes the file path as the source argument. For example:


   
    ://path/to/file.

It uses this path to determine the file format (e.g. CSV or XLS) and scheme (e.g. HTTP or postgresql). It also supports format extraction from URLs like http://example.com?format=csv. If necessary, you also can define these explicitly.

Let's try it out. First, we create a Stream object passing the path to a CSV file.

import tabulator

stream = tabulator.Stream('data.csv')

At this point, the file haven't been read yet. Let's open the stream so we can read the contents.

try:
    stream.open()
except tabulator.TabulatorException as e:
    pass  # Handle exception

This will open the underlying data stream, read a small sample to detect the file encoding, and prepare the data to be read. We catch tabulator.TabulatorException here, in case something goes wrong.

We can now read the file contents. To iterate over each row, we do:

for row in stream.iter():
    print(row)  # [value1, value2, ...]

The stream.iter() method will return each row data as a list of values. If you prefer, you could call stream.iter(keyed=True) instead, which returns a dictionary with the column names as keys. Either way, this method keeps only a single row in memory at a time. This means it can handle handle large files without consuming too much memory.

If you want to read the entire file, use stream.read(). It accepts the same arguments as stream.iter(), but returns all rows at once.

stream.reset()
rows = stream.read()

Notice that we called stream.reset() before reading the rows. This is because internally, tabulator only keeps a pointer to its current location in the file. If we didn't reset this pointer, we would read starting from where we stopped. For example, if we ran stream.read() again, we would get an empty list, as the internal file pointer is at the end of the file (because we've already read it all). Depending on the file location, it might be necessary to download the file again to rewind (e.g. when the file was loaded from the web).

After we're done, close the stream with:

stream.close()

The entire example looks like:

import tabulator

stream = tabulator.Stream('data.csv')
try:
    stream.open()
except tabulator.TabulatorException as e:
    pass  # Handle exception

for row in stream.iter():
    print(row)  # [value1, value2, ...]

stream.reset()  # Rewind internal file pointer
rows = stream.read()

stream.close()

It could be rewritten to use Python's context manager interface as:

import tabulator

try:
    with tabulator.Stream('data.csv') as stream:
        for row in stream.iter():
            print(row)

        stream.reset()
        rows = stream.read()
except tabulator.TabulatorException as e:
    pass

This is the preferred way, as Python closes the stream automatically, even if some exception was thrown along the way.

The full API documentation is available as docstrings in the Stream source code.

Headers

By default, tabulator considers that all file rows are values (i.e. there is no header).

with Stream([['name', 'age'], ['Alex', 21]]) as stream:
  stream.headers # None
  stream.read() # [['name', 'age'], ['Alex', 21]]

If you have a header row, you can use the headers argument with the its row number (starting from 1).

# Integer
with Stream([['name', 'age'], ['Alex', 21]], headers=1) as stream:
  stream.headers # ['name', 'age']
  stream.read() # [['Alex', 21]]

You can also pass a lists of strings to define the headers explicitly:

with Stream([['Alex', 21]], headers=['name', 'age']) as stream:
  stream.headers # ['name', 'age']
  stream.read() # [['Alex', 21]]

Tabulator also supports multiline headers for the xls and xlsx formats.

with Stream('data.xlsx', headers=[1, 3], fill_merged_cells=True) as stream:
  stream.headers # ['header from row 1-3']
  stream.read() # [['value1', 'value2', 'value3']]

Encoding

You can specify the file encoding (e.g. utf-8 and latin1) via the encoding argument.

with Stream(source, encoding='latin1') as stream:
  stream.read()

If this argument isn't set, Tabulator will try to infer it from the data. If you get a UnicodeDecodeError while loading a file, try setting the encoding to utf-8.

Compression (Python3-only)

Tabulator supports both ZIP and GZIP compression methods. By default it'll infer from the file name:

with Stream('http://example.com/data.csv.zip') as stream:
  stream.read()

You can also set it explicitly:

with Stream('data.csv.ext', compression='gz') as stream:
  stream.read()

Options

filename: filename in zip file to process (default is first file)

Allow html

The Stream class raises tabulator.exceptions.FormatError if it detects HTML contents. This helps avoiding the relatively common mistake of trying to load a CSV file inside an HTML page, for example on GitHub.

You can disable this behaviour using the allow_html option:

with Stream(source_with_html, allow_html=True) as stream:
  stream.read() # no exception on open

Sample size

To detect the file's headers, and run other checks like validating that the file doesn't contain HTML, Tabulator reads a sample of rows on the stream.open() method. This data is available via the stream.sample property. The number of rows used can be defined via the sample_size parameters (defaults to 100).

with Stream(two_rows_source, sample_size=1) as stream:
  stream.sample # only first row
  stream.read() # first and second rows

You can disable this by setting sample_size to zero. This way, no data will be read on stream.open().

Bytes sample size

Tabulator needs to read a part of the file to infer its encoding. The bytes_sample_size arguments controls how many bytes will be read for this detection (defaults to 10000).

source = 'data/special/latin1.csv'
with Stream(source) as stream:
    stream.encoding # 'iso8859-2'

You can disable this by setting bytes_sample_size to zero, in which case it'll use the machine locale's default encoding.

Ignore blank headers

When True, tabulator will ignore columns that have blank headers (defaults to False).

# Default behaviour
source = 'text://header1,,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1) as stream:
    stream.headers # ['header1', '', 'header3']
    stream.read(keyed=True) # {'header1': 'value1', '': 'value2', 'header3': 'value3'}

# Ignoring columns with blank headers
source = 'text://header1,,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1, ignore_blank_headers=True) as stream:
    stream.headers # ['header1', 'header3']
    stream.read(keyed=True) # {'header1': 'value1', 'header3': 'value3'}

Ignore listed/not-listed headers

The option is similar to the ignore_blank_headers. It removes arbitrary columns from the data based on the corresponding column names:

# Ignore listed headers (omit columns)
source = 'text://header1,header2,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1, ignore_listed_headers=['header2']) as stream:
    assert stream.headers == ['header1', 'header3']
    assert stream.read(keyed=True) == [
        {'header1': 'value1', 'header3': 'value3'},
    ]

# Ignore NOT listed headers (pick colums)
source = 'text://header1,header2,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1, ignore_not_listed_headers=['header2']) as stream:
    assert stream.headers == ['header2']
    assert stream.read(keyed=True) == [
        {'header2': 'value2'},
    ]

Force strings

When True, all rows' values will be converted to strings (defaults to False). None values will be converted to empty strings.

# Default behaviour
with Stream([['string', 1, datetime.datetime(2017, 12, 1, 17, 00)]]) as stream:
  stream.read() # [['string', 1, datetime.dateime(2017, 12, 1, 17, 00)]]

# Forcing rows' values as strings
with Stream([['string', 1]], force_strings=True) as stream:
  stream.read() # [['string', '1', '2017-12-01 17:00:00']]

Force parse

When True, don't raise an exception when parsing a malformed row, but simply return an empty row. Otherwise, tabulator raises tabulator.exceptions.SourceError when a row can't be parsed. Defaults to False.

# Default behaviour
with Stream([[1], 'bad', [3]]) as stream:
  stream.read() # raises tabulator.exceptions.SourceError

# With force_parse
with Stream([[1], 'bad', [3]], force_parse=True) as stream:
  stream.read() # [[1], [], [3]]

Skip rows

List of row numbers and/or strings to skip. If it's a string, all rows that begin with it will be skipped (e.g. '#' and '//'). If it's the empty string, all rows that begin with an empty column will be skipped.

source = [['John', 1], ['Alex', 2], ['#Sam', 3], ['Mike', 4], ['John', 5]]
with Stream(source, skip_rows=[1, 2, -1, '#']) as stream:
  stream.read() # [['Mike', 4]]

If the headers parameter is also set to be an integer, it will use the first not skipped row as a headers.

source = [['#comment'], ['name', 'order'], ['John', 1], ['Alex', 2]]
with Stream(source, headers=1, skip_rows=['#']) as stream:
  stream.headers # [['name', 'order']]
  stream.read() # [['Jogn', 1], ['Alex', 2]]

Post parse

List of functions that can filter or transform rows after they are parsed. These functions receive the extended_rows containing the row's number, headers list, and the row values list. They then process the rows, and yield or discard them, modified or not.

def skip_odd_rows(extended_rows):
    for row_number, headers, row in extended_rows:
        if not row_number % 2:
            yield (row_number, headers, row)

def multiply_by_two(extended_rows):
    for row_number, headers, row in extended_rows:
        doubled_row = list(map(lambda value: value * 2, row))
        yield (row_number, headers, doubled_row)

rows = [
  [1],
  [2],
  [3],
  [4],
]
with Stream(rows, post_parse=[skip_odd_rows, multiply_by_two]) as stream:
  stream.read() # [[4], [8]]

These functions are applied in order, as a simple data pipeline. In the example above, multiply_by_two just sees the rows yielded by skip_odd_rows.

Keyed and extended rows

The methods stream.iter() and stream.read() accept the keyed and extended flag arguments to modify how the rows are returned.

By default, every row is returned as a list of its cells values:

with Stream([['name', 'age'], ['Alex', 21]]) as stream:
  stream.read() # [['Alex', 21]]

With keyed=True, the rows are returned as dictionaries, mapping the column names to their values in the row:

with Stream([['name', 'age'], ['Alex', 21]]) as stream:
  stream.read(keyed=True) # [{'name': 'Alex', 'age': 21}]

And with extended=True, the rows are returned as a tuple of (row_number, headers, row), there row_number is the current row number (starting from 1), headers is a list with the headers names, and row is a list with the rows values:

with Stream([['name', 'age'], ['Alex', 21]]) as stream:
  stream.read(extended=True) # (1, ['name', 'age'], ['Alex', 21])

Supported schemes

s3

It loads data from AWS S3. For private files you should provide credentials supported by the boto3 library, for example, corresponding environment variables. Read more about configuring boto3.

stream = Stream('s3://bucket/data.csv')

Options

s3_endpoint_url - the endpoint URL to use. By default it's https://s3.amazonaws.com. For complex use cases, for example, goodtables's runs on a data package this option can be provided as an environment variable S3_ENDPOINT_URL.

file

The default scheme, a file in the local filesystem.

stream = Stream('data.csv')

http/https/ftp/ftps

In Python 2, tabulator can't stream remote data sources because of a limitation in the underlying libraries. The whole data source will be loaded to the memory. In Python 3 there is no such problem and remote files are streamed.

stream = Stream('https://example.com/data.csv')

Options

http_session - a requests.Session object. Read more in the requests docs.
http_stream - Enables or disables HTTP streaming, when possible (enabled by default). Disable it if you'd like to preload the whole file into memory.
http_timeout - This timeout will be used for a requests session construction.

stream

The source is a file-like Python object.

with open('data.csv') as fp:
    stream = Stream(fp)

text

The source is a string containing the tabular data. Both scheme and format must be set explicitly, as it's not possible to infer them.

stream = Stream(
    'name,age\nJohn, 21\n',
    scheme='text',
    format='csv'
)

Supported file formats

In this section, we'll describe the supported file formats, and their respective configuration options and operations. Some formats only support read operations, while others support both reading and writing.

csv (read & write)

stream = Stream('data.csv', delimiter=',')

Options

It supports all options from the Python CSV library. Check their documentation for more information.

xls/xlsx (read & write)

Tabulator is unable to stream xls files, so the entire file is loaded in memory. Streaming is supported for xlsx files.

stream = Stream('data.xls', sheet=1)

Options

sheet: Sheet name or number (starting from 1).
workbook_cache: An empty dictionary to handle workbook caching. If provided, tabulator will fill the dictionary with source: tmpfile_path pairs for remote workbooks. Each workbook will be downloaded only once and all the temporary files will be deleted on the process exit. Defauts: None
fill_merged_cells: if True it will unmerge and fill all merged cells by a visible value. With this option enabled the parser can't stream data and load the whole document into memory.
preserve_formatting: if True it will try to preserve text formatting of numeric and temporal cells returning it as strings according to how it looks in a spreadsheet (EXPERIMETAL)
adjust_floating_point_error: if True it will correct the Excel behaviour regarding floating point numbers

ods (read only)

This format is not included to package by default. To use it please install tabulator with an ods extras: $ pip install tabulator[ods]

Source should be a valid Open Office document.

stream = Stream('data.ods', sheet=1)

Options

sheet: Sheet name or number (starting from 1)

gsheet (read only)

A publicly-accessible Google Spreadsheet.

stream = Stream('https://docs.google.com/spreadsheets/d/
   
    ?usp=sharing'
   )
stream = Stream('https://docs.google.com/spreadsheets/d/
   
    edit#gid=
    
     '
    
   )

sql (read & write)

Any database URL supported by sqlalchemy.

stream = Stream('postgresql://name:[email protected]:5432/database', table='data')

Options

table (required): Database table name
order_by: SQL expression for row ordering (e.g. name DESC)

Data Package (read only)

This format is not included to package by default. You can enable it by installing tabulator using pip install tabulator[datapackage].

A Tabular Data Package.

stream = Stream('datapackage.json', resource=1)

Options

resource: Resource name or index (starting from 0)

inline (read only)

Either a list of lists, or a list of dicts mapping the column names to their respective values.

stream = Stream([['name', 'age'], ['John', 21], ['Alex', 33]])
stream = Stream([{'name': 'John', 'age': 21}, {'name': 'Alex', 'age': 33}])

json (read & write)

JSON document containing a list of lists, or a list of dicts mapping the column names to their respective values (see the inline format for an example).

stream = Stream('data.json', property='key1.key2')

Options

property: JSON Path to the property containing the tabular data. For example, considering the JSON {"response": {"data": [...]}}, the property should be set to response.data.
keyed (write): Save as array of arrays (default) or as array of dicts (keyed).

ndjson (read only)

stream = Stream('data.ndjson')

tsv (read only)

stream = Stream('data.tsv')

html (read only)

This format is not included to package by default. To use it please install tabulator with the html extra: $ pip install tabulator[html]

An HTML table element residing inside an HTML document.

Supports simple tables (no merged cells) with any legal combination of the td, th, tbody & thead elements.

Usually foramt='html' would need to be specified explicitly as web URLs don't always use the .html extension.

stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1', raw_html=True)

Options

selector: CSS selector for specifying which table element to extract. By default it's table, which takes the first table element in the document. If empty, will assume the entire page is the table to be extracted (useful with some Excel formats).
raw_html: False (default) to extract the textual contents of each cell. True to return the inner html without modification.

Custom file sources and formats

Tabulator is written with extensibility in mind, allowing you to add support for new tabular file formats, schemes (e.g. ssh), and writers (e.g. MongoDB). There are three components that allow this:

Loaders
- Loads a stream from some location (e.g. ssh)
Parsers
- Parses a stream of tabular data in some format (e.g. xls)
Writers
- Writes tabular data to some destination (e.g. MongoDB)

In this section, we'll see how to write custom classes to extend any of these components.

Custom loaders

You can add support for a new scheme (e.g. ssh) by creating a custom loader. Custom loaders are implemented by inheriting from the Loader class, and implementing its methods. This loader can then be used by Stream to load data by passing it via the custom_loaders={'scheme': CustomLoader} argument.

The skeleton of a custom loader looks like:

from tabulator import Loader

class CustomLoader(Loader):
  options = []

  def __init__(self, bytes_sample_size, **options):
      pass

  def load(self, source, mode='t', encoding=None):
      # load logic

with Stream(source, custom_loaders={'custom': CustomLoader}) as stream:
  stream.read()

You can see examples of how the loaders are implemented by looking in the tabulator.loaders module.

Custom parsers

You can add support for a new file format by creating a custom parser. Similarly to custom loaders, custom parsers are implemented by inheriting from the Parser class, and implementing its methods. This parser can then be used by Stream to parse data by passing it via the custom_parsers={'format': CustomParser} argument.

The skeleton of a custom parser looks like:

from tabulator import Parser

class CustomParser(Parser):
    options = []

    def __init__(self, loader, force_parse, **options):
        self.__loader = loader

    def open(self, source, encoding=None):
        # open logic

    def close(self):
        # close logic

    def reset(self):
        # reset logic

    @property
    def closed(self):
        return False

    @property
    def extended_rows(self):
        # extended rows logic

with Stream(source, custom_parsers={'custom': CustomParser}) as stream:
  stream.read()

You can see examples of how parsers are implemented by looking in the tabulator.parsers module.

Custom writers

You can add support to write files in a specific format by creating a custom writer. The custom writers are implemented by inheriting from the base Writer class, and implementing its methods. This writer can then be used by Stream to write data via the custom_writers={'format': CustomWriter} argument.

The skeleton of a custom writer looks like:

from tabulator import Writer

class CustomWriter(Writer):
  options = []

  def __init__(self, **options):
      pass

  def write(self, source, target, headers=None, encoding=None):
      # write logic

with Stream(source, custom_writers={'custom': CustomWriter}) as stream:
  stream.save(target)

You can see examples of how parsers are implemented by looking in the tabulator.writers module.

API Reference

`cli`

cli(source, limit, **options)

Command-line interface

Usage: tabulator [OPTIONS] SOURCE

Options:
  --headers INTEGER
  --scheme TEXT
  --format TEXT
  --encoding TEXT
  --limit INTEGER
  --sheet TEXT/INTEGER (excel)
  --fill-merged-cells BOOLEAN (excel)
  --preserve-formatting BOOLEAN (excel)
  --adjust-floating-point-error BOOLEAN (excel)
  --table TEXT (sql)
  --order_by TEXT (sql)
  --resource TEXT/INTEGER (datapackage)
  --property TEXT (json)
  --keyed BOOLEAN (json)
  --version          Show the version and exit.
  --help             Show this message and exit.

`Stream`

Stream(self,
       source,
       headers=None,
       scheme=None,
       format=None,
       encoding=None,
       compression=None,
       allow_html=False,
       sample_size=100,
       bytes_sample_size=10000,
       ignore_blank_headers=False,
       ignore_listed_headers=None,
       ignore_not_listed_headers=None,
       multiline_headers_joiner=' ',
       multiline_headers_duplicates=False,
       force_strings=False,
       force_parse=False,
       pick_rows=None,
       skip_rows=None,
       pick_fields=None,
       skip_fields=None,
       pick_columns=None,
       skip_columns=None,
       post_parse=[],
       custom_loaders={},
       custom_parsers={},
       custom_writers={},
       **options)

Stream of tabular data.

This is the main tabulator class. It loads a data source, and allows you to stream its parsed contents.

Arguments

source (str):
    Path to file as ``
   
    ://path/to/file.
    
     ``.
    If not explicitly set, the scheme (file, http, ...) and
    format (csv, xls, ...) are inferred from the source string.

headers (Union[int, List[int], List[str]], optional):
    Either a row
    number or list of row numbers (in case of multi-line headers) to be
    considered as headers (rows start counting at 1), or the actual
    headers defined a list of strings. If not set, all rows will be
    treated as containing values.

scheme (str, optional):
    Scheme for loading the file (file, http, ...).
    If not set, it'll be inferred from `source`.

format (str, optional):
    File source's format (csv, xls, ...). If not
    set, it'll be inferred from `source`. inferred

encoding (str, optional):
    Source encoding. If not set, it'll be inferred.

compression (str, optional):
    Source file compression (zip, ...). If not set, it'll be inferred.

pick_rows (List[Union[int, str, dict]], optional):
    The same as `skip_rows` but it's for picking rows instead of skipping.

skip_rows (List[Union[int, str, dict]], optional):
    List of row numbers, strings and regex patterns as dicts to skip.
    If a string, it'll skip rows that their first cells begin with it e.g. '#' and '//'.
    To skip only completely blank rows use `{'type': 'preset', 'value': 'blank'}`
    To provide a regex pattern use  `{'type': 'regex', 'value': '^#'}`
    For example: `skip_rows=[1, '# comment', {'type': 'regex', 'value': '^# (regex|comment)'}]`

pick_fields (str[]):
    When passed, ignores all columns with headers
    that the given list DOES NOT include

skip_fields (str[]):
    When passed, ignores all columns with headers
    that the given list includes. If it contains an empty string it will skip
    empty headers

sample_size (int, optional):
    Controls the number of sample rows used to
    infer properties from the data (headers, encoding, etc.). Set to
    ``0`` to disable sampling, in which case nothing will be inferred
    from the data. Defaults to ``config.DEFAULT_SAMPLE_SIZE``.

bytes_sample_size (int, optional):
    Same as `sample_size`, but instead
    of number of rows, controls number of bytes. Defaults to
    ``config.DEFAULT_BYTES_SAMPLE_SIZE``.

allow_html (bool, optional):
    Allow the file source to be an HTML page.
    If False, raises ``exceptions.FormatError`` if the loaded file is
    an HTML page. Defaults to False.

multiline_headers_joiner (str, optional):
    When passed, it's used to join multiline headers
    as `
     
      .join(header1_1, header1_2)`
    Defaults to ' ' (space).

multiline_headers_duplicates (bool, optional):
    By default tabulator will exclude a cell of a miltilne header from joining
    if it's exactly the same as the previous seen value in this field.
    Enabling this option will force duplicates inclusion
    Defaults to False.

force_strings (bool, optional):
    When True, casts all data to strings.
    Defaults to False.

force_parse (bool, optional):
    When True, don't raise exceptions when
    parsing malformed rows, simply returning an empty value. Defaults
    to False.

post_parse (List[function], optional):
    List of generator functions that
    receives a list of rows and headers, processes them, and yields
    them (or not). Useful to pre-process the data. Defaults to None.

custom_loaders (dict, optional):
    Dictionary with keys as scheme names,
    and values as their respective ``Loader`` class implementations.
    Defaults to None.

custom_parsers (dict, optional):
    Dictionary with keys as format names,
    and values as their respective ``Parser`` class implementations.
    Defaults to None.

custom_loaders (dict, optional):
    Dictionary with keys as writer format
    names, and values as their respective ``Writer`` class
    implementations. Defaults to None.

**options (Any, optional): Extra options passed to the loaders and parsers.

`stream.closed`

Returns True if the underlying stream is closed, False otherwise.

Returns

bool: whether closed

`stream.compression`

Stream's compression ("no" if no compression)

Returns

str: compression

`stream.encoding`

Stream's encoding

Returns

str: encoding

`stream.format`

Path's format

Returns

str: format

`stream.fragment`

Path's fragment

Returns

str: fragment

`stream.hash`

Returns the SHA256 hash of the read chunks if available

Returns

str/None: SHA256 hash

`stream.headers`

Headers

Returns

str[]/None: headers if available

`stream.sample`

Returns the stream's rows used as sample.

These sample rows are used internally to infer characteristics of the source file (e.g. encoding, headers, ...).

Returns

list[]: sample

`stream.scheme`

Path's scheme

Returns

str: scheme

`stream.size`

Returns the BYTE count of the read chunks if available

Returns

int/None: BYTE count

`stream.source`

Source

Returns

any: stream source

`stream.open`

stream.open()

Opens the stream for reading.

Raises:

TabulatorException: if an error

`stream.close`

stream.close()

Closes the stream.

`stream.reset`

stream.reset()

Resets the stream pointer to the beginning of the file.

`stream.iter`

stream.iter(keyed=False, extended=False)

Iterate over the rows.

Each row is returned in a format that depends on the arguments keyed and extended. By default, each row is returned as list of their values.

Arguments

keyed (bool, optional): When True, each returned row will be a dict mapping the header name to its value in the current row. For example, [{'name': 'J Smith', 'value': '10'}]. Ignored if extended is True. Defaults to False.
extended (bool, optional): When True, returns each row as a tuple with row number (starts at 1), list of headers, and list of row values. For example, (1, ['name', 'value'], ['J Smith', '10']). Defaults to False.

Raises

exceptions.TabulatorException: If the stream is closed.

Returns

Iterator[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]: The row itself. The format depends on the values of keyed and extended arguments.

`stream.read`

stream.read(keyed=False, extended=False, limit=None)

Returns a list of rows.

Arguments

keyed (bool, optional): See :func:Stream.iter.
extended (bool, optional): See :func:Stream.iter.
limit (int, optional): Number of rows to return. If None, returns all rows. Defaults to None.

Returns

List[Union[List[Any], Dict[str, Any], Tuple[int, List[str], List[Any]]]]: The list of rows. The format depends on the values of keyed and extended arguments.

`stream.save`

stream.save(target, format=None, encoding=None, **options)

Save stream to the local filesystem.

Arguments

target (str): Path where to save the stream.
format (str, optional): The format the stream will be saved as. If None, detects from the target path. Defaults to None.
encoding (str, optional): Saved file encoding. Defaults to config.DEFAULT_ENCODING.
**options: Extra options passed to the writer.

Returns

count (int?): Written rows count if available Building index... Started generating documentation...

`Loader`

Loader(self, bytes_sample_size, **options)

Abstract class implemented by the data loaders

The loaders inherit and implement this class' methods to add support for a new scheme (e.g. ssh).

Arguments

bytes_sample_size (int): Sample size in bytes
**options (dict): Loader options

`loader.options`

`loader.load`

loader.load(source, mode='t', encoding=None)

Load source file.

Arguments

source (str): Path to tabular source file.
mode (str, optional): Text stream mode, t (text) or b (binary). Defaults to t.
encoding (str, optional): Source encoding. Auto-detect by default.

Returns

Union[TextIO, BinaryIO]: I/O stream opened either as text or binary.

`Parser`

Parser(self, loader, force_parse, **options)

Abstract class implemented by the data parsers.

The parsers inherit and implement this class' methods to add support for a new file type.

Arguments

loader (tabulator.Loader): Loader instance to read the file.
force_parse (bool): When True, the parser yields an empty extended row tuple (row_number, None, []) when there is an error parsing a row. Otherwise, it stops the iteration by raising the exception tabulator.exceptions.SourceError.
**options (dict): Loader options

`parser.closed`

Flag telling if the parser is closed.

Returns

bool: whether closed

`parser.encoding`

Encoding

Returns

str: encoding

`parser.extended_rows`

Returns extended rows iterator.

The extended rows are tuples containing (row_number, headers, row),

Raises

SourceError: If force_parse is False and a row can't be parsed, this exception will be raised. Otherwise, an empty extended row is returned (i.e. (row_number, None, [])).

Returns: Iterator[Tuple[int, List[str], List[Any]]]: Extended rows containing (row_number, headers, row), where headers is a list of the header names (can be None), and row is a list of row values.

`parser.options`

`parser.open`

parser.open(source, encoding=None)

Open underlying file stream in the beginning of the file.

The parser gets a byte or text stream from the tabulator.Loader instance and start emitting items.

Arguments

source (str): Path to source table.
encoding (str, optional): Source encoding. Auto-detect by default.

Returns

None

`parser.close`

parser.close()

Closes underlying file stream.

`parser.reset`

parser.reset()

Resets underlying stream and current items list.

After reset() is called, iterating over the items will start from the beginning.

`Writer`

Writer(self, **options)

Abstract class implemented by the data writers.

The writers inherit and implement this class' methods to add support for a new file destination.

Arguments

**options (dict): Writer options.

`writer.options`

`writer.write`

writer.write(source, target, headers, encoding=None)

Writes source data to target.

Arguments

source (str): Source data.
target (str): Write target.
headers (List[str]): List of header names.
encoding (str, optional): Source file encoding.

Returns

count (int?): Written rows count if available

`validate`

validate(source, scheme=None, format=None)

Check if tabulator is able to load the source.

Arguments

source (Union[str, IO]): The source path or IO object.
scheme (str, optional): The source scheme. Auto-detect by default.
format (str, optional): The source file format. Auto-detect by default.

Raises

SchemeError: The file scheme is not supported.
FormatError: The file format is not supported.

Returns

bool: Whether tabulator is able to load the source file.

`TabulatorException`

TabulatorException()

Base class for all tabulator exceptions.

`SourceError`

SourceError()

The source file could not be parsed correctly.

`SchemeError`

SchemeError()

The file scheme is not supported.

`FormatError`

FormatError()

The file format is unsupported or invalid.

`EncodingError`

EncodingError()

Encoding error

`CompressionError`

CompressionError()

Compression error

`IOError`

IOError()

Local loading error

`LoadingError`

LoadingError()

Local loading error

`HTTPError`

HTTPError()

Remote loading error

Contributing

The project follows the Open Knowledge International coding standards.

Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:

$ make install

To run tests with linting and coverage:

$ make test

To run tests without Internet:

$ pytest -m 'not remote

Changelog

Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.

v1.53

Add support for raw_html extraction in html parser (#341)

v1.52

Published stream.dialect (works only for csv, for now)

v1.51

Added experimental table discovery options

v1.50

Ensure that headers is always a list of strings

v1.49

Added workbook_cache argument for XLSX formats

v1.48

Published stream.hashing_algorithm property

v1.47

Added hashing_algorithm parameter

v1.46

Fixed multiline headers merging
Introduced a multiline_headers_duplicates flag

v1.45

HTML format: adds support for empty selector (#321)

v1.44

Exposed stream.compression

v1.43

Exposed stream.source

v1.42

Exposed format option to the CLI

v1.41

Implemented a pick_rows parameter (opposite to skip_rows)

v1.40

Implemented stream.save() returning count of written rows

v1.39

Implemented JSON writer (#311)

v1.38

Use chardet for encoding detection by default. For cchardet: pip install tabulator[cchardet]. Due to a great deal of problems caused by ccharted for non-Linux/Conda installations we're returning back to using chardet by default.

v1.37

Raise IOError/HTTPError even a not-existent file has a bad format (#304)

v1.36

Implemented blank preset for skip_rows (#302)

v1.35

Added skip/pick_columns aliases for (#293)

v1.34

Added multiline_headers_joiner argument (#291)

v1.33

Added support for regex patterns in skip_rows (#290)

v1.32

Added ability to skip columns (#293)

v1.31

Added xlsx writer
Added html reader

v1.30

Added adjust_floating_point_error parameter to the xlsx parser

v1.29

Implemented the stream.size and stream.hash properties

v1.28

Added SQL writer

v1.27

Added http_timeout argument for the http/https format

v1.26

Added stream.fragment field showing e.g. Excel sheet's or DP resource's name

v1.25

Added support for the s3 file scheme (data loading from AWS S3)

v1.24

Added support for compressed file-like objects

v1.23

Added a setter for the stream.headers property

v1.22

The headers parameter will now use the first not skipped row if the skip_rows parameter is provided and there are comments on the top of a data source (see #264)

v1.21

Implemented experimental preserve_formatting for xlsx

v1.20

Added support for specifying filename in zip source

v1.19

Updated behaviour:

For ods format the boolean, integer and datetime native types are detected now

v1.18

Updated behaviour:

For xls format the boolean, integer and datetime native types are detected now

v1.17

Updated behaviour:

Added support for Python 3.7

v1.16

New API added:

skip_rows support for an empty string to skip rows with an empty first column

v1.15

New API added:

Format will be extracted from URLs like http://example.com?format=csv

v1.14

Updated behaviour:

Now xls booleans will be parsed as booleans not integers

v1.13

New API added:

The skip_rows argument now supports negative numbers to skip rows starting from the end

v1.12

Updated behaviour:

Instead of raising an exception, a UserWarning warning will be emitted if an option isn't recognized.

v1.11

New API added:

Added http_session argument for the http/https format (it uses requests now)
Added support for multiline headers: headers argument accept ranges like [1,3]

v1.10

New API added:

Added support for compressed files i.e. zip and gz on Python3
The Stream constructor now accepts a compression argument
The http/https scheme now accepts a http_stream flag

v1.9

Improved behaviour:

The headers argument allows to set the order for keyed sources and cherry-pick values

v1.8

New API added:

Formats XLS/XLSX/ODS supports sheet names passed via the sheet argument
The Stream constructor accepts an ignore_blank_headers option

v1.7

Improved behaviour:

Rebased datapackage format on [email protected] library

v1.6

New API added:

Argument source for the Stream constructor can be a pathlib.Path

v1.5

New API added:

Argument bytes_sample_size for the Stream constructor

v1.4

Improved behaviour:

Updated encoding name to a canonical form

v1.3

New API added:

stream.scheme
stream.format
stream.encoding

Promoted provisional API to stable API:

Loader (custom loaders)
Parser (custom parsers)
Writer (custom writers)
validate

v1.2

Improved behaviour:

Autodetect common CSV delimiters

v1.1

New API added:

Added fill_merged_cells option to xls/xlsx formats

v1.0

New API added:

published Loader/Parser/Writer API
Added Stream argument force_strings
Added Stream argument force_parse
Added Stream argument custom_writers

Deprecated API removal:

removed topen and Table - use Stream instead
removed Stream arguments loader/parser_options - use **options instead

Provisional API changed:

Updated the Loader/Parser/Writer API - please use an updated version

v0.15

Provisional API added:

Unofficial support for Stream arguments custom_loaders/parsers

Standardize management tools:
make for tasks (no deps, simple, works)

pylint

let users manage their own venv

info.json available inside package so can be used, for example, in __version__ as well as setup

follow common practice of not having requirements.txt for libs

changes to various CLI config files

prep travis for automating package deployment

@roll this is not for merging right at this moment: it is for your feedback. Sorry I'm using your codebase to test this out, but, you know why we are doing this :).
opened by pwalsh 23
Feature/management update
Work based on https://github.com/datapackages/tabulator-py/pull/40 PR and discussion.

All things in PR above

Included - pylint, travis, no activate, no editorconfig, no requirements.txt etc

Make for tasks

As discussed in the PR topic tox is a great test-runner but make is better suitable for run tasks. Also one of the main thing - there is no need to install make to do initial dependency install.

All metadata in setup.py

As suggested by @vitorbaptista the most pythonic way is to have all package metadata (except readme and version) in setup.py - that's true almost all devs look into setup.py to read it or figure out where to find it. So no package/info.json, requirements*.txt. Version has to be parsed from package (TODO, very simple).

All tools available

After make install in venv contributor will be having all tools prepared to work. Environment variable has been removed from tox so all the tools py.test, pylint, tox etc could be used as usual without make call. Of course on high-level we provide make review and make test.

DRY

Unique files per project:

.coveragerc (may be we'll find a workaround how to remove hardcoded package name from here)

Makefile (may be we'll find a workaround how to remove hardcoded package name from here)

README.md

setup.py

Initially standard files per project:

.gitignore

.pylintrc

.travis.yml

CONTRIBUTING.md

LICENSE.md

MANIFEST.in

setup.cfg

tox.ini

Basic contribution guide (only tech aspects) for this proposal - https://github.com/datapackages/tabulator-py/blob/feature/management-update/CONTRIBUTING.md - deadly simple with no deps to start.
opened by roll 16
Make use of content-type to detect format
Currently, when using tabulator with URL sources, format is infered from:

suffix (e.g. https://example.com/files/foobar.csv -> csv)

format URL parameter (e.g. https://example.com/foobar?format=xls&param=baz -> xls)

One step beyond: when an URL has no format suffix, neither format url parameter, tabulator can request HEAD information and infer format from Content-type information. E.g content type 'text/csv' means csv format.

Content-type to format is defined as a dict (CONTENT_TYPE_FORMAT) that can be extended as needed to support more mime-types.
wontfix review
opened by pierredittgen 14
Implement sql writer
Overview

For now tabulator support only stream.save(format='csv') to csv format. It's pretty easy to implement sql writer just porting writers.csv.CSVWriter to writers.sql.SQLWrter.

What we're aiming for:

from tabulator import Stream with Stream('data.xls', headers=1) as stream: stream.save('postgresql://user:[email protected]:5432/database', table='excel_export')

And of course it will be a pretty cool and useful feature :+1:

Plan

[ ] port writers.csv.CSVWriter to writers.sql.SQLWrter

[ ] register new writer in config.py

[ ] add writing tests to tests.formats.sql

[ ] mention writing ability in readme sql format section

feature contribute
opened by roll 13
Add force strings option to stringify output

Add a switch in the Stream constructor to force all fields as strings. Currently it seems that this feature would only be relevant for Excel sources. See discussion here.
feature

opened by cyberbikepunk 10
Feature/updated api
Overview

Things needed by goodtables.next and https://github.com/frictionlessdata/jsontableschema-py/pull/90. Also some additional improvements.

Issues

fixes #36

fixes #71

fixes #72

fixes #75

fixes #76

fixes #77

fixes #78
opened by roll 10
Write mode for topen
Overview

It's a long shot but eventually I suppose it could be implemented.

Having a task to write some tabular data to the filesystem is casual. With all the boilerplate code to support py2/3, csv verbose interface etc - it's a little bit annoying.

Analysis

Interface could be:

with topen('table.csv', mode='w') as table: table.write(data)

Implementation could be:

topen returns ReadTable or WriteTable regarding to the mode='r/w'

for writing there will be new modules like Formatter (anti-parser) and Writer (anti-loader) with the same modular arhictecture for different targets and formats.

So we will be able to have memory lean things like:

with (topen('http://site.com/source.xls') as source, topen('target.csv', mode='w') as target): target.write(source)

Or even with something like tcopy helper:

tcopy(data, 'target.csv') tcopy('http://site.com/source.xls', 'target.csv')

Tasks

[ ] TBD

[ ] We should support recoding - https://github.com/frictionlessdata/tabulator-py/issues/50

feature
opened by roll 10
Added support for regex patterns in skip_rows
fixes https://github.com/BCODMO/frictionless-usecases/issues/34

@akariv @cschloer
Here is regex support for skip_rows:

skip_rows=[1, '# comment', '^# (regex|comment)']

What do you think :

it's OK to have one argument for it OR

it's better to use skip_rows and skip_rows_regex
opened by roll 9
Ability to skip last rows.
fixes https://github.com/frictionlessdata/tabulator-py/issues/224

fixes #226

Added tests, code and updated readme.

Details

I added one more built-in processor for Stream (tabulator/stream.py line 448) , which has a buffer, so it could delete rows counting them from the end.
opened by AcckiyGerman 9
Remove jsontableschema dependency
It fixes #51

Instead of having processors.Schema(schema) now we have processors.Convert(converter) to convert string to python objects. It could be used also with JTS:

schema = jsontableschema.Schema(descriptor) topen('path.csv', processors=[processors.Convert(schema.convert)])
opened by roll 9
Problem with xlsx with merged cells and multiline header

Overview

When trying to read the file below, tabulator reads the headers on rows 10-12 incorrectly: http://mapipcissprd.us-east-1.elasticbeanstalk.com/api/public/population-tracking-tool/data/2017,2020/?page=1&limit=1&condition=A&export=true&country=AF

I used options headers=[10,12], fill_merged_cells=True: stream = Stream('http://mapipcissprd.us-east-1.elasticbeanstalk.com/api/public/population-tracking-tool/data/2017,2020/?page=1&limit=1&condition=A&export=true&country=AF', headers=[10,12], fill_merged_cells=True, format='xlsx') I also tried fill_merged_cells=False.

An example of what's missing can be seen in column K of the spreadsheet. The header should be for that one "Current Phase 1 #" but instead is just "Phase 1 #".

Pandas is able to read it using: df = pandas.read_excel(url, header=[9, 10, 11])

Please preserve this line to notify @roll (lead of this repository)
bug

opened by mcarans 8

Python library for reading and writing tabular data via streams.

Related tags

Overview

tabulator-py

Features

Contents

Getting started

Installation

Running on CLI

Running on Python

Documentation

Working with Stream

Headers

Encoding

Compression (Python3-only)

Allow html

Sample size

Bytes sample size

Ignore blank headers

Ignore listed/not-listed headers

Force strings

Force parse

Skip rows

Post parse

Keyed and extended rows

Supported schemes

s3

file

http/https/ftp/ftps

stream

text

Supported file formats

csv (read & write)

xls/xlsx (read & write)

ods (read only)

gsheet (read only)

sql (read & write)

Data Package (read only)

inline (read only)

json (read & write)

ndjson (read only)

tsv (read only)

html (read only)

Custom file sources and formats

Custom loaders

Custom parsers

Custom writers

API Reference

cli

Stream

stream.closed

stream.compression

stream.encoding

stream.format

stream.fragment

stream.hash

stream.headers

stream.sample

stream.scheme

stream.size

stream.source

stream.open

stream.close

stream.reset

stream.iter

stream.read

stream.save

Loader

loader.options

loader.load

Parser

parser.closed

parser.encoding

parser.extended_rows

parser.options

parser.open

parser.close

parser.reset

Writer

writer.options

`cli`

`Stream`

`stream.closed`

`stream.compression`

`stream.encoding`

`stream.format`

`stream.fragment`

`stream.hash`

`stream.headers`

`stream.sample`

`stream.scheme`

`stream.size`

`stream.source`

`stream.open`

`stream.close`

`stream.reset`

`stream.iter`

`stream.read`

`stream.save`

`Loader`

`loader.options`

`loader.load`

`Parser`

`parser.closed`

`parser.encoding`

`parser.extended_rows`

`parser.options`

`parser.open`

`parser.close`

`parser.reset`

`Writer`

`writer.options`

`writer.write`

`validate`

`TabulatorException`

`SourceError`

`SchemeError`

`FormatError`

`EncodingError`

`CompressionError`

`IOError`

`LoadingError`

`HTTPError`