Utils for streaming large files (S3, HDFS, gzip, bz2...)

Overview

smart_open — utils for streaming large files in Python

License GHA Coveralls Downloads

What?

smart_open is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

smart_open is a drop-in replacement for Python's built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

Python 2.7 is no longer supported. If you need Python 2.7, please use smart_open 1.10.1, the last version to support Python 2.

Why?

Working with large remote files, for example using Amazon's boto3 Python library, is a pain. boto3's Object.upload_fileobj() and Object.download_fileobj() methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers. smart_open shields you from that. It builds on boto3 and other remote storage libraries, but offers a clean unified Pythonic API. The result is less code for you to write and fewer bugs to make.

How?

smart_open is well-tested, well-documented, and has a simple Pythonic API:

>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
...    print(repr(line))
...    break
'User-Agent: *\n'

>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'):
...    print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'

>>> # can use context managers too:
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
...    with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout:
...        for line in fin:
...           fout.write(line)
74
80
78
79

>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
...     for line in fin:
...         print(repr(line.decode('utf-8')))
...         break
...     offset = fin.seek(0)  # seek to the beginning
...     print(fin.read(4))
'User-Agent: *\n'
b'User'

>>> # stream from HTTP
>>> for line in open('http://example.com/index.html'):
...     print(repr(line))
...     break
'\n'

Other examples of URLs that smart_open accepts:

s3://my_bucket/my_key
s3://my_key:[email protected]_bucket/my_key
s3://my_key:[email protected]_server:[email protected]_bucket/my_key
gs://my_bucket/my_blob
azure://my_bucket/my_blob
hdfs:///path/file
hdfs://path/file
webhdfs://host:port/path/file
./local/path/file
~/local/path/file
local/path/file
./local/path/file.gz
file:///home/user/file
file:///home/user/file.bz2
[ssh|scp|sftp]://[email protected]//path/file
[ssh|scp|sftp]://[email protected]/path/file
[ssh|scp|sftp]://username:[email protected]/path/file

Documentation

Installation

smart_open supports a wide range of storage solutions, including AWS S3, Google Cloud and Azure. Each individual solution has its own dependencies. By default, smart_open does not install any dependencies, in order to keep the installation size small. You can install these dependencies explicitly using:

pip install smart_open[azure] # Install Azure deps
pip install smart_open[gcs] # Install GCS deps
pip install smart_open[s3] # Install S3 deps

Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:

pip install smart_open[all]

Be warned that this option increases the installation size significantly, e.g. over 100MB.

If you're upgrading from smart_open versions 2.x and below, please check out the Migration Guide.

Built-in help

For detailed API info, see the online help:

help('smart_open')

or click here to view the help in your browser.

More examples

For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:

pip install smart_open[all]
>>> import os, boto3
>>>
>>> # stream content *into* S3 (write mode) using a custom session
>>> session = boto3.Session(
...     aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
...     aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
... )
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout:
...     bytes_written = fout.write(b'hello world!')
...     print(bytes_written)
12
# stream from HDFS
for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'):
    print(line)

# stream from WebHDFS
for line in open('webhdfs://host:port/user/hadoop/my_file.txt'):
    print(line)

# stream content *into* HDFS (write mode):
with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream content *into* WebHDFS (write mode):
with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream from a completely custom s3 server, like s3proxy:
for line in open('s3u://user:[email protected]:[email protected]/mykey.txt'):
    print(line)

# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile
session = boto3.Session(profile_name='digitalocean')
client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
transport_params = {'client': client}
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
    fout.write(b'here we stand')

# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
    print(line)

# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
    fout.write(b'hello world')

# stream from Azure Blob Storage
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
    'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params):
    print(line)

# stream content *into* Azure Blob Storage (write mode):
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
    'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout:
    fout.write(b'hello world')

Compression Handling

The top-level compression parameter controls compression/decompression behavior when reading and writing. The supported values for this parameter are:

  • infer_from_extension (default behavior)
  • disable
  • .gz
  • .bz2

By default, smart_open determines the compression algorithm to use based on the file extension.

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

You can override this behavior to either disable compression, or explicitly specify the algorithm to use. To disable compression:

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin:
...     print(fin.read(32))
b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@'

To specify the algorithm explicitly (e.g. for non-standard file extensions):

>>> from smart_open import open, register_compressor
>>> with open('smart_open/tests/test_data/1984.txt.gzip', compression='.gz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

You can also easily add support for other file extensions and compression formats. For example, to open xz-compressed files:

>>> import lzma, os
>>> from smart_open import open, register_compressor

>>> def _handle_xz(file_obj, mode):
...      return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)

>>> register_compressor('.xz', _handle_xz)

>>> with open('smart_open/tests/test_data/1984.txt.xz') as fin:
...     print(fin.read(32))
It was a bright cold day in Apri

lzma is in the standard library in Python 3.3 and greater. For 2.7, use backports.lzma.

Transport-specific Options

smart_open supports a wide range of transport options out of the box, including:

  • S3
  • HTTP, HTTPS (read-only)
  • SSH, SCP and SFTP
  • WebHDFS
  • GCS
  • Azure Blob Storage

Each option involves setting up its own set of parameters. For example, for accessing S3, you often need to set up authentication, like API keys or a profile name. smart_open's open function accepts a keyword argument transport_params which accepts additional parameters for the transport layer. Here are some examples of using this parameter:

>>> import boto3
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3')))
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024))

For the full list of keyword arguments supported by each transport option, see the documentation:

help('smart_open.open')

S3 Credentials

smart_open uses the boto3 library to talk to S3. boto3 has several mechanisms for determining the credentials to use. By default, smart_open will defer to boto3 and let the latter take care of the credentials. There are several ways to override this behavior.

The first is to pass a boto3.Client object as a transport parameter to the open function. You can customize the credentials when constructing the session for the client. smart_open will then use the session when talking to S3.

session = boto3.Session(
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY,
    aws_session_token=SESSION_TOKEN,
)
client = session.client('s3', endpoint_url=..., config=...)
fin = open('s3://bucket/key', transport_params=dict(client=client))

Your second option is to specify the credentials within the S3 URL itself:

fin = open('s3://aws_access_key_id:[email protected]/key', ...)

Important: The two methods above are mutually exclusive. If you pass an AWS client and the URL contains credentials, smart_open will ignore the latter.

Important: smart_open ignores configuration files from the older boto library. Port your old boto settings to boto3 in order to use them with smart_open.

Iterating Over an S3 Bucket's Contents

Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra function smart_open.s3.iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

>> # we use workers=1 for reproducibility; you should use as many workers as you have cores >>> bucket = 'silo-open-data' >>> prefix = 'annual/monthly_rain/' >>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3): ... print(key, round(len(content) / 2**20)) annual/monthly_rain/2010.monthly_rain.nc 13 annual/monthly_rain/2011.monthly_rain.nc 13 annual/monthly_rain/2012.monthly_rain.nc 13 ">
>>> from smart_open import s3
>>> # get data corresponding to 2010 and later under "silo-open-data/annual/monthly_rain"
>>> # we use workers=1 for reproducibility; you should use as many workers as you have cores
>>> bucket = 'silo-open-data'
>>> prefix = 'annual/monthly_rain/'
>>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3):
...     print(key, round(len(content) / 2**20))
annual/monthly_rain/2010.monthly_rain.nc 13
annual/monthly_rain/2011.monthly_rain.nc 13
annual/monthly_rain/2012.monthly_rain.nc 13

GCS Credentials

smart_open uses the google-cloud-storage library to talk to GCS. google-cloud-storage uses the google-cloud package under the hood to handle authentication. There are several options to provide credentials. By default, smart_open will defer to google-cloud-storage and let it take care of the credentials.

To override this behavior, pass a google.cloud.storage.Client object as a transport parameter to the open function. You can customize the credentials when constructing the client. smart_open will then use the client when talking to GCS. To follow allow with the example below, refer to Google's guide to setting up GCS authentication with a service account.

import os
from google.cloud.storage import Client
service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS']
client = Client.from_service_account_json(service_account_path)
fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))

If you need more credential options, you can create an explicit google.auth.credentials.Credentials object and pass it to the Client. To create an API token for use in the example below, refer to the GCS authentication guide.

import os
from google.auth.credentials import Credentials
from google.cloud.storage import Client
token = os.environ['GOOGLE_API_TOKEN']
credentials = Credentials(token=token)
client = Client(credentials=credentials)
fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client))

Azure Credentials

smart_open uses the azure-storage-blob library to talk to Azure Blob Storage. By default, smart_open will defer to azure-storage-blob and let it take care of the credentials.

Azure Blob Storage does not have any ways of inferring credentials therefore, passing a azure.storage.blob.BlobServiceClient object as a transport parameter to the open function is required. You can customize the credentials when constructing the client. smart_open will then use the client when talking to. To follow allow with the example below, refer to Azure's guide to setting up authentication.

import os
from azure.storage.blob import BlobServiceClient
azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING']
client = BlobServiceClient.from_connection_string(azure_storage_connection_string)
fin = open('azure://my_container/my_blob.txt', transport_params=dict(client=client))

If you need more credential options, refer to the Azure Storage authentication guide.

File-like Binary Streams

The open function also accepts file-like objects. This is useful when you already have a binary file open, and would like to wrap it with transparent decompression:

>>> import io, gzip
>>>
>>> # Prepare some gzipped binary data in memory, as an example.
>>> # Any binary file will do; we're using BytesIO here for simplicity.
>>> buf = io.BytesIO()
>>> with gzip.GzipFile(fileobj=buf, mode='w') as fout:
...     _ = fout.write(b'this is a bytestring')
>>> _ = buf.seek(0)
>>>
>>> # Use case starts here.
>>> buf.name = 'file.gz'  # add a .name attribute so smart_open knows what compressor to use
>>> import smart_open
>>> smart_open.open(buf, 'rb').read()  # will gzip-decompress transparently!
b'this is a bytestring'

In this case, smart_open relied on the .name attribute of our binary I/O stream buf object to determine which decompressor to use. If your file object doesn't have one, set the .name attribute to an appropriate value. Furthermore, that value has to end with a known file extension (see the register_compressor function). Otherwise, the transparent decompression will not occur.

Drop-in replacement of pathlib.Path.open

smart_open.open can also be used with Path objects. The built-in Path.open() is not able to read text from compressed files, so use patch_pathlib to replace it with smart_open.open() instead. This can be helpful when e.g. working with compressed files.

>> >>> with path.open("r") as infile: ... print(infile.readline()[:41]) В начале июля, в чрезвычайно жаркое время ">
>>> from pathlib import Path
>>> from smart_open.smart_open_lib import patch_pathlib
>>>
>>> _ = patch_pathlib()  # replace `Path.open` with `smart_open.open`
>>>
>>> path = Path("smart_open/tests/test_data/crime-and-punishment.txt.gz")
>>>
>>> with path.open("r") as infile:
...     print(infile.readline()[:41])
В начале июля, в чрезвычайно жаркое время

How do I ...?

See this document.

Extending smart_open

See this document.

Testing smart_open

smart_open comes with a comprehensive suite of unit tests. Before you can run the test suite, install the test dependencies:

pip install -e .[test]

Now, you can run the unit tests:

pytest smart_open

The tests are also run automatically with Travis CI on every commit push & pull request.

Comments, bug reports

smart_open lives on Github. You can file issues or pull requests there. Suggestions, pull requests and improvements welcome!


smart_open is open source software released under the MIT license. Copyright (c) 2015-now Radim Řehůřek.

Comments
  • Data loss while writing avro file to s3 compatible storage

    Data loss while writing avro file to s3 compatible storage

    Hi,

    I am converting a csv file into avro and writing to s3 compliant storage.I see that schema file(.avsc) is written properly. However, there is data loss while writing to .avro file. Below is snippet of my code

    ## Code
    import smart_open
    from boto.compat import urlsplit, six
    import boto
    import boto.s3.connection
    
    import avro.schema
    from avro.datafile import  DataFileWriter 
    from avro.io import  DatumWriter
    
    import pandas as pn
    import os,sys
    
    FilePath = 's3a://mybucket/vinuthnav/csv/file1.csv' #path on s3
    
    splitInputDir = urlsplit(FilePath, allow_fragments=False)
    
    inConn = boto.connect_s3(
    	aws_access_key_id = access_key_id,
    	aws_secret_access_key = secret_access_key,
    	port=int(port),
    	host = hostname,
    	is_secure=False,
    	calling_format = boto.s3.connection.OrdinaryCallingFormat(),
    	)
    #get bucket
    inbucket = inConn.get_bucket(splitInputDir.netloc)
    #read in the csv file
    kr = inbucket.get_key(splitInputDir.path)
    with smart_open.smart_open(kr, 'r') as fin:
    	xa = pn.read_csv(fin, header=1, error_bad_lines = False).fillna('NA')
    		
    rowCount, columnCount = xa.shape #check if data frame is empty, if it is, don't write outp
    if rowCount == 0:
    	##do nothing
    	print '>> [NOTE] empty file'
    	
    
    else:
    	#generate avro schema and data
    	
    	dataFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avro")
    	schemaFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avsc")
    	
    	kwd = inbucket.get_key(urlsplit(dataFile, allow_fragments=False).path, validate=False)
    	schema = gen_schema(xa.columns)
    	
    	with smart_open.smart_open(kwd, 'wb') as foutd: 
    		
    		dictRes = xa.to_dict(orient='records')
    		writer = DataFileWriter(foutd, DatumWriter(), schema)
    		for ll, row in enumerate(dictRes):
    			writer.append(row)
    
    bug 
    opened by vinuthna91 78
  • Don't package all cloud dependencies at once

    Don't package all cloud dependencies at once

    Problem description

    Smart open is a really useful library, but I find it a bit annoying to have all dependencies packaged with it. Most of the time one doesnt need to manipulate s3 and gcs and maybe azure storage if it were to be integrated in smart-open.

    What I wish I could do:

    pip install smart-open would be the same behaviour as now pip install smart-open[s3] would only install boto3 dependencies pip install smart-open[gcs] same for gcs ...

    Note:

    If you find it interesting I can assign this to myself and work on the project

    BTW: i think it's the same behaviour for gensim, it packages boto3 but is not needed for common nlp tasks

    opened by Tyrannas 31
  • Use GCS blob interface

    Use GCS blob interface

    Fixes #599 - Swap to using GCS native blob open under the hood.

    This should reduce the amount of custom code to maintain, I have tried to keep the interfaces identical so there is no API breaking changes. Though this does mean there is still lots of code that can be trimmed down.

    I think it might be worth re-thinking the test coverage and if the test suites like FakeAuthorizedSessionTest are still valid/useful.

    What do you think? @petedannemann

    awaiting-response 
    opened by cadnce 29
  • Reading S3 files becomes slow after 1.5.4

    Reading S3 files becomes slow after 1.5.4

    As mentioned earlier in #74, it appears that the reading speed is very slow after 1.5.4.

    $ pyvenv-3.4 env
    $ source env/bin/activate
    $ pip install smart_open==1.5.3 tqdm ipython
    $ ipython
    
    from tqdm import tqdm
    from smart_open import smart_open
    for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
        pass
    

    2868923it [00:53, 53888.94it/s]

    $ pyvenv-3.4 env
    $ source env/bin/activate
    $ pip install smart_open==1.5.4 tqdm ipython
    $ ipython
    
    from tqdm import tqdm
    from smart_open import smart_open
    for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
        pass
    

    8401it [00:18, 442.64it/s] (too slow so I could not wait for it to finish.)

    opened by appierys 26
  • Can no longer write gzipped files.

    Can no longer write gzipped files.

    This new check has removed the ability to write gzipped files to S3.

    It looks like native gzipping is being added to smart_open and that's why this check was put in place. However, until the new write functionality is added this check should be removed in order to allow users to write their own compressed stream.

    opened by balihoo-dengstrom 24
  • Support Azure Storage Blob

    Support Azure Storage Blob

    Support Azure Storage Blob

    Motivation

    Support of reading and writing blobs with Azure Storage Blob.

    • Fix #228

    If you're adding a new feature, then consider opening a ticket and discussing it with the maintainers before you actually do the hard work.

    Tests

    If you're fixing a bug, consider test-driven development:

    1. Create a unit test that demonstrates the bug. The test should fail.
    2. Implement your bug fix.
    3. The test you created should now pass.

    If you're implementing a new feature, include unit tests for it.

    Make sure all existing unit tests pass. You can run them locally using:

    pytest smart_open
    

    If there are any failures, please fix them before creating the PR (or mark it as WIP, see below).

    Work in progress

    If you're still working on your PR, include "WIP" in the title. We'll skip reviewing it for the time being. Once you're ready to review, remove the "WIP" from the title, and ping one of the maintainers (e.g. mpenkov).

    Checklist

    Before you create the PR, please make sure you have:

    • [x] Picked a concise, informative and complete title
    • [x] Clearly explained the motivation behind the PR
    • [x] Linked to any existing issues that your PR will be solving
    • [x] Included tests for any new functionality
    • [x] Checked that all unit tests pass

    Workflow

    Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress. Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

    new-feature 
    opened by nclsmitchell 22
  • Investigate building wheels for smart_open

    Investigate building wheels for smart_open

    This has certain benefits:

    1. The pip client no longer has to build wheels itself when installing
    2. The install process is marginally faster
    3. Any others?

    Also, are there any compelling reasons to avoid building wheels?

    @menshikh-iv @piskvorky @gojomo

    housekeeping 
    opened by mpenkov 18
  • setup.py: Removed httpretty dependency

    setup.py: Removed httpretty dependency

    Checking whether httpretty is really required? When looking at code I could not find any imports of httpretty. The file test_smart_open.py uses mock — an other library for mocking. This also explain, why some versions broke tests. So, I suppose, that httpretty is here only due to some legacy reasons and can therefore be removed.

    @tmylk can you double check this? I think we can remove httpretty from dependency list and thereby also resolve some issues.

    opened by nikicc 18
  • cannot import name 'open' from 'smart_open'

    cannot import name 'open' from 'smart_open'

    I am receiving the error File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\utils.py", line 45, in from smart_open import open

    ImportError: cannot import name open I am using python 2.7.16, the gensim is in 3.8.2 and smart-open is 1.10.1. Any ideas of what is going on?

    need-info 
    opened by littleyee 16
  • Google Cloud Storage (GCS)

    Google Cloud Storage (GCS)

    Google Cloud Storage (GCS) Support

    Motivation

    • Adds GCS Support #198

    Checklist

    Before you create the PR, please make sure you have:

    • [x] Picked a concise, informative and complete title
    • [x] Clearly explained the motivation behind the PR
    • [x] Linked to any existing issues that your PR will be solving
    • [x] Included tests for any new functionality
    • [x] Checked that all unit tests pass

    We will need to figure out how we plan to deal with integration testing on GCP. Would RaRe be willing to host the bucket? We will need to update Travis to include those tests if so.

    EDIT: Removed comment about the testing timeout issue. Since fixing the memory issue with reads, it has gone away.

    opened by petedannemann 16
  • Cannot install if `LC_ALL=C`

    Cannot install if `LC_ALL=C`

    When the system environment variable LC_ALL=C I cannot install smart_open. The problem is in the dependency httpretty, since setup.py requires the version httpretty==0.8.6 which is know not to work with LC_ALL=C. The error I get is this:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 133: ordinal not in range(128)
    

    httpretty fixed this error in version 0.8.8, so I am wondering if it would be possible to to relax the requirement to httpretty>=0.8.6?

    I actually discovered this when trying to install gensim, which also did not work since it requires smart_open.

    opened by nikicc 16
  • Make s3u protocol have the intended non-SSL behaviour

    Make s3u protocol have the intended non-SSL behaviour

    Currently https is hardcoded in when s3* protocol is used. However in the documentation it states s3u is the non-SSL version, but this appears unimplemented.

    When s3u is used, use http rather than https.

    Tests

    Could a maintainer please advise how (if?) a test should be written for this change. I don't believe vanilla AWS S3 supports unsecured http.

    I am not able to run pytest on this PR as I don't have access to an AWS S3 bucket (I am making this change so I can use smart_open with my minio installation).

    Checklist

    Before you create the PR, please make sure you have:

    • [x] Picked a concise, informative and complete title
    • [x] Clearly explained the motivation behind the PR
    • [x] Linked to any existing issues that your PR will be solving
    • [ ] Included tests for any new functionality
    • [ ] Checked that all unit tests pass
    opened by fosslinux 0
  • WIP: Fix #684 Abort S3 MultipartUpload if exception is raised

    WIP: Fix #684 Abort S3 MultipartUpload if exception is raised

    Motivation

    • Fixes #684 AWS Supports multipart upload of a file. I big file split by chunks uploaded one by one and on S3 concatenated again. In our case, if an exception is raised while processing one of the parts we have to abort uploading to avoid corrupted file creation.

    Tests

    test_write_gz_with_error

    opened by lociko 0
  • Check files consistency between cloud providers storages

    Check files consistency between cloud providers storages

    Hi,

    I've been experimenting with smart_open and can't figure out which way can I ensure that files are consistent when coping data between GCS and S3 (and vice versa).

    with open(uri=f"...",mode='rb',transport_params=dict(client=gcs_client)) as fout:
        with open(uri=f"...", mode='wb',transport_params=s3_tp) as fin:
            for line in fout:
                fin.write(line)
    

    ETags are not matching (which is expected I guess), but files are different in size when copied from GCS to S3. gsutil shows size 1340495 bytes and after copying to s3 it's 1291979 bytes (though the file itself seems ok). I've tried turn off s3 multipart_upload, but that doesn't change the behaviour.

    If I use below ordinary way to read/write files, my file size taken from gcs and written to s3 matches, and I can create validation process.

    for blob in blobs:
        buffer = io.BytesIO()
        blob.download_to_file(buffer)
        buffer.seek(0)
        s3_client.put_object(Body=buffer, Bucket='...' Key=blob.name)
    

    Which mechanism can be used to validate files consistency after copy?

    PyDev console: 
    macOS-13.1-arm64-arm-64bit
    Python 3.10.5 (v3.10.5:f377153967, Jun  6 2022, 12:36:10) [Clang 13.0.0 (clang-1300.0.29.30)]
    smart_open 6.3.0
    
    opened by nenkie76 0
  • Fix s3.iter_bucket failure when botocore_session passed in

    Fix s3.iter_bucket failure when botocore_session passed in

    Motivation

    Fixes #670

    As shown by @SootyOwl, when a user declares their own botocore_session object and passes it into s3.iter_bucket, one of two errors occur:

    1. With smart_open.concurrency._MULTIPROCESSING = True: AttributeError: Can't pickle local object 'lazy_call.<locals>._handler
    2. With smart_open.concurrency._MULTIPROCESSING = False: RuntimeError: Cannot inject class attribute "upload_file", attribute already exists in class dict.

    As explained here, the reason the first error occurs is that the multiprocessing module performs pickling on objects and requires those objects to be global, not local.

    As explained in the original raised issue, the reason the second error occurs is that _list_bucket and _download_key both creates boto3.session.Session objects out of the passed in botocore_session, which is not allowed by the boto3 library.

    The proposed changes address both issues by creating a global session object within iter_bucket that _list_bucket and _download_key can access.

    Tests

    All existing tests related to iter_bucket within s3.py pass. I also added two new tests: test_iter_bucket_passed_in_session_multiprocessing_false and test_iter_bucket_passed_in_session_multiprocessing_true. These test the two previously failing situations.

    opened by RachitSharma2001 6
  • fix: S3 ignore seek requests to the current position

    fix: S3 ignore seek requests to the current position

    Motivation

    When callers perform a seek() on a S3 backed file handle that seek can be ignored if it is to the current position. Python's ZipFile module often seeks to the current position causing performance to be quite slow when reading zip files from S3.

    This change compares the current position vs the destination position and preserves the buffer if possible while still populating the EOF flag.

    This addresses: #742

    opened by rustyconover 4
  • S3 ContentEncoding is disregarded

    S3 ContentEncoding is disregarded

    Problem description

    This I believe is the same issue as #422 but it's for S3.

    Certain libraries, like django_s3_storage use ContentEncoding https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L330 to express on-the-fly compression/decompression.

    Smart open does not support this and I have to manually check for the presence of ContentEncoding when reading such files. The s3 documentation specifies:

    ContentEncoding (string) -- Specifies what content encodings have been applied to the object and thus what decoding mechanisms must be applied to obtain the media-type referenced by the Content-Type header field.

    Is this something that can/will be implemented at some point?

    Steps/code to reproduce the problem

    It's hard to give precise steps, but simply put uploading a .txt file with .txt extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not.

    Versions

    Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5
    Python 3.7.10 (default, Jun  3 2021, 00:02:01)
    [GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]
    smart_open 6.2.0
    
    opened by goranvinterhalter 2
Releases(v6.3.0)
  • v6.3.0(Dec 12, 2022)

    What's Changed

    • upgrade pyopenssl versions as part of github actions workflows by @mpenkov in https://github.com/RaRe-Technologies/smart_open/pull/722
    • Fixes #537 - Added documentation to support GCS anonymously by @cadnce in https://github.com/RaRe-Technologies/smart_open/pull/728
    • setup.py: Remove pathlib2 by @jayvdb in https://github.com/RaRe-Technologies/smart_open/pull/733
    • Add flake8 config globally by @cadnce in https://github.com/RaRe-Technologies/smart_open/pull/732
    • added buffer_size parameter to http module by @mullenkamp in https://github.com/RaRe-Technologies/smart_open/pull/730
    • Support for reading and writing files directly to/from ftp by @RachitSharma2001 in https://github.com/RaRe-Technologies/smart_open/pull/723
    • Improve instructions for testing & contributing by @Kache in https://github.com/RaRe-Technologies/smart_open/pull/718
    • Add FTPS support (#33) by @RachitSharma2001 in https://github.com/RaRe-Technologies/smart_open/pull/739
    • Bring back compression_wrapper(filename) + use case-insensitive extension matching by @piskvorky in https://github.com/RaRe-Technologies/smart_open/pull/737
    • Reconnect inactive sftp clients automatically by @Kache in https://github.com/RaRe-Technologies/smart_open/pull/719
    • Fix avoidable S3 race condition (#693) by @RachitSharma2001 in https://github.com/RaRe-Technologies/smart_open/pull/735
    • Refactor Google Cloud Storage to use blob.open by @ddelange in https://github.com/RaRe-Technologies/smart_open/pull/744
    • update CHANGELOG.md for release 6.3.0 by @mpenkov in https://github.com/RaRe-Technologies/smart_open/pull/746

    New Contributors

    • @cadnce made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/728
    • @mullenkamp made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/730
    • @RachitSharma2001 made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/723
    • @Kache made their first contribution in https://github.com/RaRe-Technologies/smart_open/pull/718

    Full Changelog: https://github.com/RaRe-Technologies/smart_open/compare/v6.2.0...v6.3.0

    Source code(tar.gz)
    Source code(zip)
  • v6.2.0(Sep 14, 2022)

    6.2.0, 14 September 2022

    6.1.0, 21 August 2022

    • Add cert parameter to http transport params (PR #703, @stev-0)
    • Allow passing additional kwargs for Azure writes (PR #702, @ddelange)

    6.0.0, 24 April 2022

    This release deprecates the old ignore_ext parameter. Use the compression parameter instead.

    fin = smart_open.open("/path/file.gz", ignore_ext=True)  # No
    fin = smart_open.open("/path/file.gz", compression="disable")  # Yes
    
    fin = smart_open.open("/path/file.gz", ignore_ext=False)  # No
    fin = smart_open.open("/path/file.gz")  # Yes
    fin = smart_open.open("/path/file.gz", compression="infer_from_extension")  # Yes, if you want to be explicit
    
    fin = smart_open.open("/path/file", compression=".gz")  # Yes
    
    • Make Python 3.7 the required minimum (PR #688, @mpenkov)
    • Drop deprecated ignore_ext parameter (PR #661, @mpenkov)
    • Drop support for passing buffers to smart_open.open (PR #660, @mpenkov)
    • Support working directly with file descriptors (PR #659, @mpenkov)
    • Added support for viewfs:// URLs (PR #665, @ChandanChainani)
    • Fix AttributeError when reading passthrough zstandard (PR #658, @mpenkov)
    • Make UploadFailedError picklable (PR #689, @birgerbr)
    • Support container client and blob client for azure blob storage (PR #652, @cbare)
    • Pin google-cloud-storage to >=1.31.1 in extras (PR #687, @PLPeeters)
    • Expose certain transport-specific methods e.g. to_boto3 in top layer (PR #664, @mpenkov)
    • Use pytest instead of parameterizedtestcase (PR #657, @mpenkov)

    5.2.1, 28 August 2021

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)
    Source code(tar.gz)
    Source code(zip)
  • v6.1.0(Aug 21, 2022)

  • v6.0.0(Apr 24, 2022)

    6.0.0, 24 April 2022

    This release deprecates the old ignore_ext parameter. Use the compression parameter instead.

    fin = smart_open.open("/path/file.gz", ignore_ext=True)  # 🚫 No
    fin = smart_open.open("/path/file.gz", compression="disable")  # Yes
    
    fin = smart_open.open("/path/file.gz", ignore_ext=False)  # 🚫 No
    fin = smart_open.open("/path/file.gz")  # Yes
    fin = smart_open.open("/path/file.gz", compression="infer_from_extension")  # Yes, if you want to be explicit
    
    fin = smart_open.open("/path/file", compression=".gz")  # Yes
    
    • Make Python 3.7 the required minimum (PR #688, @mpenkov)
    • Drop deprecated ignore_ext parameter (PR #661, @mpenkov)
    • Drop support for passing buffers to smart_open.open (PR #660, @mpenkov)
    • Support working directly with file descriptors (PR #659, @mpenkov)
    • Added support for viewfs:// URLs (PR #665, @ChandanChainani)
    • Fix AttributeError when reading passthrough zstandard (PR #658, @mpenkov)
    • Make UploadFailedError picklable (PR #689, @birgerbr)
    • Support container client and blob client for azure blob storage (PR #652, @cbare)
    • Pin google-cloud-storage to >=1.31.1 in extras (PR #687, @PLPeeters)
    • Expose certain transport-specific methods e.g. to_boto3 in top layer (PR #664, @mpenkov)
    • Use pytest instead of parameterizedtestcase (PR #657, @mpenkov)

    5.2.1, 28 August 2021

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.2.1(Aug 28, 2021)

    5.2.1, 28 August 2021

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.2.0(Aug 18, 2021)

    5.2.0, 18 August 2021

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.1.0(May 25, 2021)

    5.1.0, 25 May 2021

    This release introduces a new top-level parameter: compression. It controls compression behavior and partially overlaps with the old ignore_ext parameter. For details, see the README.rst file. You may continue to use ignore_ext parameter for now, but it will be deprecated in the next major release.

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v5.0.0(Mar 30, 2021)

    5.0.0, 30 Mar 2021

    This release modifies the handling of transport parameters for the S3 back-end in a backwards-incompatible way. See the migration docs for details.

    • Refactor S3, replace high-level resource/session API with low-level client API (PR #583, @mpenkov)
    • Fix potential infinite loop when reading from webhdfs (PR #597, @traboukos)
    • Add timeout parameter for http/https (PR #594, @dustymugs)
    • Remove tests directory from package (PR #589, @e-nalepa)

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v4.2.0(Feb 15, 2021)

    Unreleased

    4.2.0, 15 Feb 2021

    • Support tell() for text mode write on s3/gcs/azure (PR #582, @markopy)
    • Implement option to use a custom buffer during S3 writes (PR #547, @mpenkov)

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v4.1.2(Jan 18, 2021)

    Unreleased

    4.1.2, 18 Jan 2021

    • Correctly pass boto3 resource to writers (PR #576, @jackluo923)
    • Improve robustness of S3 reading (PR #552, @mpenkov)
    • Replace codecs with TextIOWrapper to fix newline issues when reading text files (PR #578, @markopy)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • v4.1.0(Dec 30, 2020)

    4.1.0, 30 Dec 2020

    • Refactor s3 submodule to minimize resource usage (PR #569, @mpenkov)
    • Change download_as_string to download_as_bytes in gcs submodule (PR #571, @alexandreyc)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].

    4.0.0, 24 Nov 2020

    • Fix reading empty file or seeking past end of file for s3 backend (PR #549, @jcushman)
    • Fix handling of rt/wt mode when working with gzip compression (PR #559, @mpenkov)
    • Bump minimum Python version to 3.6 (PR #562, @mpenkov)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake.

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket

    2.1.1, 27 Aug 2020

    • Bypass unnecessary GCS storage.buckets.get permission (PR #516, @gelioz)
    • Allow SFTP connection with SSH key (PR #522, @rostskadat)

    2.1.0, 1 July 2020

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)

    1.10.1, 26 April 2020

    • This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.
    • Use only if you need Python 2.

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

    pip install smart_open[all]
    

    See the README.rst for details.

    1.10.0, 16 Mar 2020

    1.9.0, 3 Nov 2019

    1.8.4, 2 Jun 2019

    1.8.3, 26 April 2019

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
    • Fix #289: the smart_open package now correctly exposes a __version__ attribute
    • Fix #285: handle edge case with question marks in an S3 URL

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see the README.rst file for details.

    1.8.1, 6 April 2019

    smart_open.open

    This new function replaces smart_open.smart_open, which is now deprecated. Main differences:

    • ignore_extension → ignore_ext
    • new transport_params dict parameter to contain keyword parameters for the transport layer (S3, HTTPS, HDFS, etc).

    Main advantages of the new function:

    • Simpler interface for the user, less parameters
    • Greater API flexibility: adding additional keyword arguments will no longer require updating the top-level interface
    • Better documentation for keyword parameters (previously, they were documented via examples only)

    The old smart_open.smart_open function is deprecated, but continues to work as previously.

    1.8.0, 17th January 2019

    1.7.1, 18th September 2018

    • Unpin boto/botocore for regular installation. Fix #227 (PR #232, @menshikh-iv)

    1.7.0, 18th September 2018

    1.6.0, 29th June 2018

    • Migrate to boto3. Fix #43 (PR #164, @mpenkov)
    • Refactoring smart_open to share compression and encoding functionality (PR #185, @mpenkov)
    • Drop python2.6 compatibility. Fix #156 (PR #192, @mpenkov)
    • Accept a custom boto3.Session instance (support STS AssumeRole). Fix #130, #149, #199 (PR #201, @eschwartz)
    • Accept multipart_upload parameters (supports ServerSideEncryption) for S3. Fix (PR #202, @eschwartz)
    • Add support for pathlib.Path. Fix #170 (PR #175, @clintval)
    • Fix performance regression using local file-system. Fix #184 (PR #190, @mpenkov)
    • Replace ParsedUri class with functions, cleanup internal argument parsing (PR #191, @mpenkov)
    • Handle edge case (read 0 bytes) in read function. Fix #171 (PR #193, @mpenkov)
    • Fix bug with changing f._current_pos when call f.readline() (PR #182, @inksink)
    • Сlose the old body explicitly after seek for S3. Fix #187 (PR #188, @inksink)

    1.5.7, 18th March 2018

    • Fix author/maintainer fields in setup.py, avoid bug from setuptools==39.0.0 and add workaround for botocore and python==3.3. Fix #176 (PR #178 & #177, @menshikh-iv & @baldwindc)

    1.5.6, 28th December 2017

    1.5.5, 6th December 2017

    • Fix problems from 1.5.4 release. Fix #153, #154 , partial fix #152 (PR #155, @mpenkov)

    1.5.4, 30th November 2017

    1.5.3, 18th May 2017

    • Remove GET parameters from url. Fix #120 (PR #121, @mcrowson)

    1.5.2, 12th Apr 2017

    • Enable compressed formats over http. Avoid filehandle leak. Fix #109 and #110. (PR #112, @robottwo )
    • Make possible to change number of retries (PR #102, @shaform)

    1.5.1, 16th Mar 2017

    • Bugfix for compressed formats (PR #110, @tmylk)

    1.5.0, 14th Mar 2017

    • HTTP/HTTPS read support w/ Kerberos (PR #107, @robottwo)

    1.4.0, 13th Feb 2017

    • HdfsOpenWrite implementation similar to read (PR #106, @skibaa)
    • Support custom S3 server host, port, ssl. (PR #101, @robottwo)
    • Add retry around s3_iter_bucket_process_key to address S3 Read Timeout errors. (PR #96, @bbbco)
    • Include tests data in sdist + install them. (PR #105, @cournape)

    1.3.5, 5th October 2016

    - Add MANIFEST.in required for conda-forge recip (PR #90, @tmylk)

    • Fix #92. Allow hash in filename (PR #93, @tmylk)

    1.3.4, 26th August 2016

    • Relative path support (PR #73, @yupbank)
    • Move gzipstream module to smart_open package (PR #81, @mpenkov)
    • Ensure reader objects never return None (PR #81, @mpenkov)
    • Ensure read functions never return more bytes than asked for (PR #84, @mpenkov)
    • Add support for reading gzipped objects until EOF, e.g. read() (PR #81, @mpenkov)
    • Add missing parameter to read_from_buffer call (PR #84, @mpenkov)
    • Add unit tests for gzipstream (PR #84, @mpenkov)
    • Bundle gzipstream to enable streaming of gzipped content from S3 (PR #73, @mpenkov)
    • Update gzipstream to avoid deep recursion (PR #73, @mpenkov)
    • Implemented readline for S3 (PR #73, @mpenkov)
    • Added pip requirements.txt (PR #73, @mpenkov)
    • Invert NO_MULTIPROCESSING flag (PR #79, @Janrain-Colin)
    • Add ability to add query to webhdfs uri. (PR #78, @ellimilial)

    1.3.3, 16th May 2016

    • Accept an instance of boto.s3.key.Key to smart_open (PR #38, @asieira)
    • Allow passing encrypt_key and other parameters to initiate_multipart_upload (PR #63, @asieira)
    • Allow passing boto host and profile_name to smart_open (PR #71 #68, @robcowie)
    • Write an empty key to S3 even if nothing is written to S3OpenWrite (PR #61, @petedmarsh)
    • Support LC_ALL=C environment variable setup (PR #40, @nikicc)
    • Python 3.5 support

    1.3.2, 3rd January 2016

    • Bug fix release to enable 'wb+' file mode (PR #50)

    1.3.1, 18th December 2015

    • Disable multiprocessing if unavailable. Allows to run on Google Compute Engine. (PR #41, @nikicc)
    • Httpretty updated to allow LC_ALL=C locale config. (PR #39, @jsphpl)
    • Accept an instance of boto.s3.key.Key (PR #38, @asieira)

    1.3.0, 19th September 2015

    • WebHDFS read/write (PR #29, @ziky90)
    • re-upload last S3 chunk in failed upload (PR #20, @andreycizov)
    • return the entire key in s3_iter_bucket instead of only the key name (PR #22, @salilb)
    • pass optional keywords on S3 write (PR #30, @val314159)
    • smart_open a no-op if passed a file-like object with a read attribute (PR #32, @gojomo)
    • various improvements to testing (PR #30, @val314159)

    1.1.0, 1st February 2015

    • support for multistream bzip files (PR #9, @pombredanne)
    • introduce this CHANGELOG
    Source code(tar.gz)
    Source code(zip)
  • 4.0.1(Nov 27, 2020)

    4.0.1, 27 Nov 2020

    • Exclude requests from install_requires dependency list. If you need it, use pip install smart_open[http] or pip install smart_open[webhdfs].
    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Nov 24, 2020)

  • 3.0.0(Oct 8, 2020)

    3.0.0, 8 Oct 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[s3]
    

    to install the AWS dependencies only.

    Or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(Oct 1, 2020)

    2.2.1, 1 Oct 2020

    • Include S3 dependencies by default, because removing them in the 2.2.0 minor release was a mistake. It broke existing code in a minor release.
    • Instead, S3 dependencies will not be installed by default in the next major smart_open release, 3.0.0. So if you don't want to install S3 dependencies, to keep your smart_open installation lean, install 3.0.0 instead.
    Source code(tar.gz)
    Source code(zip)
  • 2.2.0(Sep 25, 2020)

    2.2.0, 25 Sep 2020

    This release modifies the behavior of setup.py with respect to dependencies. Previously, boto3 and other AWS-related packages were installed by default. Now, in order to install them, you need to run either:

    pip install smart_open[aws]
    

    to install the AWS dependencies only, or

    pip install smart_open[all]
    

    to install all dependencies, including AWS, GCS, etc.

    Summary of changes:

    • Correctly pass newline parameter to built-in open function (PR #478, @burkovae)
    • Remove boto as a dependency (PR #523, @isobit)
    • Performance improvement: avoid redundant GetObject API queries in s3.Reader (PR #495, @jcushman)
    • Support installing smart_open without AWS dependencies (PR #534, @justindujardin)
    • Take object version into account in to_boto3 method (PR #539, @interpolatio)

    Deprecations

    Functionality on the left hand side will be removed in future releases. Use the functions on the right hand side instead.

    • smart_open.s3_iter_bucketsmart_open.s3.iter_bucket
    Source code(tar.gz)
    Source code(zip)
  • 2.1.1(Aug 27, 2020)

  • 2.1.0(Jul 1, 2020)

  • 2.0.0(Apr 28, 2020)

    2.0.0, 27 April 2020, "Python 3"

    • This version supports Python 3 only (3.5+).
      • If you still need Python 2, install the smart_open==1.10.1 legacy release instead.
    • Prevent smart_open from writing to logs on import (PR #476, @mpenkov)
    • Modify setup.py to explicitly support only Py3.5 and above (PR #471, @Amertz08)
    • Include all the test_data in setup.py (PR #473, @sikuan)
    Source code(tar.gz)
    Source code(zip)
  • 1.10.1(Apr 26, 2020)

    1.10.1, 26 Apr 2020

    This is the last version to support Python 2.7. Versions 1.11 and above will support Python 3 only.

    • Temporarily disable Google Cloud Storage transport mechanism for this release. If you want to use GCS, please use version 1.11 and above.
    Source code(tar.gz)
    Source code(zip)
  • 1.11.1(Apr 8, 2020)

    1.11.1, 8 Apr 2020

    • Add missing boto dependency (Issue #468)

    1.11.0, 8 Apr 2020

    Starting with this release, you will have to run:

    pip install smart_open[gcs] to use the GCS transport.
    

    In the future, all extra dependencies will be optional. If you want to continue installing all of them, use:

        pip install smart_open[all]
    

    See the README.rst for details.

    Source code(tar.gz)
    Source code(zip)
  • 1.11.0(Apr 8, 2020)

    1.11.0, 8 Apr 2020

    Source code(tar.gz)
    Source code(zip)
  • 1.10.0(Mar 16, 2020)

    Source code(tar.gz)
    Source code(zip)
  • 1.9.0(Nov 3, 2019)

    1.9.0

    Source code(tar.gz)
    Source code(zip)
  • 1.8.4(Jun 2, 2019)

  • 1.8.3(Apr 26, 2019)

  • 1.8.2(Apr 17, 2019)

    1.8.2, 17 April 2019

    • Removed dependency on lzma (PR #262, @tdhopper)
    • Backward compatibility fixes (PR #294, @mpenkov)
    • Minor fixes (PR #291, @mpenkov)
      • Fix #289: the smart_open package now correctly exposes a __version__ attribute
      • Fix #285: handled edge case in S3 URLs containing a question mark (?)
      • Fix #288: switched from logging to warnings at import time
      • Fix #47: added unit tests to cover absence of multiprocessing

    This release rolls back support for transparently decompressing .xz files, previously introduced in 1.8.1. This is a useful feature, but it requires a tricky dependency. It's still possible to handle .xz files with relatively little effort. Please see README.rst for details.

    Source code(tar.gz)
    Source code(zip)
  • 1.8.1(Apr 8, 2019)

  • 1.8.0(Jan 17, 2019)

    Source code(tar.gz)
    Source code(zip)
  • 1.7.1(Sep 19, 2018)

Owner
RARE Technologies
Pragmatic machine learning & NLP
RARE Technologies
Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files

asammdf is a fast parser and editor for ASAM (Association for Standardization of Automation and Measuring Systems) MDF (Measurement Data Format) files

Daniel Hrisca 440 Dec 31, 2022
Powerful Python library for atomic file writes.

Powerful Python library for atomic file writes.

Markus Unterwaditzer 313 Oct 19, 2022
A small Python module for determining appropriate platform-specific dirs, e.g. a "user data dir".

the problem What directory should your app use for storing user data? If running on macOS, you should use: ~/Library/Application Support/AppName If

ActiveState Software 948 Dec 31, 2022
A Python library that provides basic functions to read / write Aseprite format files

A Python library that provides basic functions to read / write Aseprite format files

Joe Trewin 1 Jan 13, 2022
Search for files under the specified directory. Extract the file name and file path and import them as data.

Search for files under the specified directory. Extract the file name and file path and import them as data. Based on that, search for the file, select it and open it.

G-jon FujiYama 2 Jan 10, 2022
CleverCSV is a Python package for handling messy CSV files.

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line applicati

The Alan Turing Institute 1k Dec 19, 2022
CSV-Handler written in Python3

CSVHandler This code allows you to work intelligently with CSV files. A file in CSV syntax is converted into several lists, which are combined in a to

Max Tischberger 1 Jan 13, 2022
Pure Python tools for reading and writing all TIFF IFDs, sub-IFDs, and tags.

Tiff Tools Pure Python tools for reading and writing all TIFF IFDs, sub-IFDs, and tags. Developed by Kitware, Inc. with funding from The National Canc

Digital Slide Archive 32 Dec 14, 2022
Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Uproot is a library for reading and writing ROOT files in pure Python and NumPy. Unlike the standard C++ ROOT implementation, Uproot is only an I/O li

Scikit-HEP Project 164 Dec 31, 2022
🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

🧹 Create symlinks for .m2ts files and classify them into directories in yyyy-mm format.

Nep 2 Feb 07, 2022
MetaMove is written in Python3 and aims at easing batch renaming operations based on file meta data.

MetaMove MetaMove is written in Python3 and aims at easing batch renaming operations based on file meta data. MetaMove abuses eval combined with f-str

Jan Philippi 2 Dec 28, 2021
shred - A cross-platform library for securely deleting files beyond recovery.

shred Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https:

4 Sep 04, 2021
Vericopy - This Python script provides various usage modes for secure local file copying and hashing.

Vericopy This Python script provides various usage modes for secure local file copying and hashing. Hash data is captured and logged for paths before

15 Nov 05, 2022
Various converters to convert value sets from CSV to JSON, etc.

ValueSet Converters Tools for converting value sets in different formats. Such as converting extensional value sets in CSV format to JSON format able

Health Open Terminology Ecosystem 4 Sep 08, 2022
Simple archive format designed for quickly reading some files without extracting the entire archive

Simple archive format designed for quickly reading some files without extracting the entire archive

Jarred Sumner 336 Dec 30, 2022
A simple file module for creating, editing and saving files.

A simple file module for creating, editing and saving files.

1 Nov 25, 2021
A tiny Python library for writing multi-channel TIFF stacks.

xtiff A tiny Python library for writing multi-channel TIFF stacks. The aim of this library is to provide an easy way to write multi-channel image stac

23 Dec 27, 2022
Pti-file-format - Reverse engineering the Polyend Tracker instrument file format

pti-file-format Reverse engineering the Polyend Tracker instrument file format.

Jaap Roes 14 Dec 30, 2022
Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

Evan 1 Dec 14, 2021
Python package to read and display segregated file names present in a directory based on type of the file

tpyfilestructure Python package to read and display segregated file names present in a directory based on type of the file. Installation You can insta

Tharun Kumar T 2 Nov 28, 2021