extract-paragraphs-with-aws-textract

Since AWS Textract (the AWS OCR service) does not have a native function to extract paragraphs, this repository provides a set of Python 3.X functions built on top of the AWS Python SDK (boto3) to extract paragraphs from AWS Textract responses.

PLEASE NOTE THAT:

It is assumed that your client has the neccesary IAM permissions to access the different AWS resources required.
Since AWS Textract analyze PDF files by running asynchronous operations, the current version assumes that you've already created an s3 bucket and that the PDF files are already stored there. If not, please go to the boto3 docs to know how to create a bucket as well as upload files.
The paragraph_constructor is an ad hoc function for my use case. You may have to adapt it based on the space between lines in your data.

UPCOMING FEATURES:

Address abstract cases with the paragrpah_constructor function.
Export data in different formats.
AWS CloudFormation template for a serverless architecture to execute the functions when a new object is uploaded in your S3 bucket.

Please feel free to suggest new features or improvements to the current code. <3

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
LICENSE		LICENSE
README.md		README.md
paragraph_extraction.py		paragraph_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

paragraph_extraction.py

paragraph_extraction.py

Repository files navigation

extract-paragraphs-with-aws-textract

About

Releases

Packages

Languages

License

jsanzolac/extract-paragraphs-with-aws-textract

Folders and files

Latest commit

History

Repository files navigation

extract-paragraphs-with-aws-textract

About

Resources

License

Stars

Watchers

Forks

Languages