A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

A Python Library to interface with Tumblr v2 REST API & OAuth

Tumblpy Tumblpy is a Python library to help interface with Tumblr v2 REST API & OAuth Features Retrieve user information and blog information Common T

Mike Helmick 125 Jun 20, 2022
API para realizar parser de frases

NLP API Simple api to parse and apply some preprocessing steps in portuguses phrases (pt_BR) This api uses the great FastAPI and spaCy packages! Usage

⟠ Rodolfo De Nadai 1 Dec 28, 2021
Fetch tracking numbers of Amazon orders, for the ease of the logistics.

Amazon-Tracking-Number Fetch tracking numbers of Amazon orders, for the ease of the logistics. Read Me First (How to use this code): Get Amazon "Items

Tony Yao 1 Nov 02, 2021
An information scroller Twitter trends, news, weather for raspberry pi and Pimoroni Unicorn Hat Mini and Scroll Phat HD.

uticker An information scroller Twitter trends, news, weather for raspberry pi and Pimoroni Unicorn Hat Mini and Scroll Phat HD. Features include: Twi

Tansu Şenyurt 1 Nov 28, 2021
Autodrive is designed to make it as easy as possible to interact with the Google Drive and Sheets APIs via Python

Autodrive Autodrive is designed to make it as easy as possible to interact with the Google Drive and Sheets APIs via Python. It is especially designed

Chris Larabee 1 Oct 02, 2021
An Open Source ALL-In-One Telegram RoBot, that can do lot of things.

An Open Source ALL-In-One Telegram RoBot, that can do lot of things.

JOBIN 0 Dec 01, 2021
Running Performance Calculator

Running Performance Calculator 👉 Have you ever wondered if you ran 10km at 2000

Davide Liu 6 Oct 26, 2022
A discord bot that will help you browse/download nhentai sources.

Risa Introduction Risa is an nHentai discord bot that will help you browse and download your favorite doujin inside your own discord server. Hosting M

markee7 14 Oct 25, 2021
This script books automatically a slot on Doctolib in one of the public vaccination centers in Berlin.

BOOKING IN BERLINS VACCINATION CENTERS This python script books automatically a slot on Doctolib in one of the public vaccination centers in Berlin. T

17 Jan 13, 2022
Autofilter with imdb bot || broakcast , imdb poster and imdb rating

LuciferMoringstar_Robot How To Deploy Video Subscribe YouTube Channel Added Features Imdb posters for autofilter. Imdb rating for autofilter. Custom c

Muhammed 127 Dec 29, 2022
A Telegram Bot to return Youtube Video Tags Using YoutubeTags API

YouTube-TagFind-Bot A Telegram Bot to return Youtube Video Tags Using YoutubeTags API YoutubeTags API Wrapper YoutubeTags is a python third-party api

Nuhman Pk 9 Aug 25, 2022
Braintree Python library

Braintree Python library The Braintree Python library provides integration access to the Braintree Gateway. TLS 1.2 required The Payment Card Industry

Braintree 230 Dec 18, 2022
Unlimited Filter Telegram Bot 2

Mother NAther Bot Features Auto Filter Manuel Filter IMDB Admin Commands Broadcast Index IMDB search Inline Search Random pics ids and User info Stats

LɪᴏɴKᴇᴛᴛʏUᴅ 1 Oct 30, 2021
🦊 Powerfull Discord Nitro Generator

🦊 Follow me here 🦊 Discord | YouTube | Github ☕ Usage 💻 Downloading git clone https://github.com/KanekiWeb/Nitro-Generator/new/main pip insta

Kaneki 104 Jan 02, 2023
Infrastructure template and Jupyter notebooks for running RoseTTAFold on AWS Batch.

AWS RoseTTAFold Infrastructure template and Jupyter notebooks for running RoseTTAFold on AWS Batch. Overview Proteins are large biomolecules that play

AWS Samples 20 May 10, 2022
Most Simple & Powefull web3 Trade Bot (WINDOWS LINUX) Suport BSC ETH

Most Simple & Powefull Trade Bot (WINDOWS LINUX) What Are Some Pros And Cons Of Owning A Sniper Bot? While having a sniper bot is typically an advanta

GUI BOT 4 Jan 25, 2022
Discord bot for playing blindfold chess.

Albin Discord bot for playing blindfold chess written in Python. Albin takes the moves from chat and pushes them on the board without showing it. TODO

8 Oct 14, 2022
✖️ Unofficial API of 1337x.to

✖️ Unofficial Python API Wrapper of 1337x This is the unofficial API of 1337x. It supports all proxies of 1337x and almost all functions of 1337x. You

Hemanta Pokharel 71 Dec 26, 2022
3X Fast Telethon Based Bot

📺 YouTube Song Downloader Bot For Telegram 🔮 3X Fast Telethon Based Bot ⚜ Easy To Deploy 🤗

@Dk_king_offcial 1 Dec 09, 2021
TORNADO CASH Proxy Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX)

TORNADO CASH Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX) ⭐️ A ful

Crypto Trader 1 Jan 06, 2022