A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

Feedback-TelegramBot is a resemblance bot which can be deployed on server

Feedback-TelegramBot Feedback-TelegramBot is a resemblance bot which can be deployed on server This work is based on Telegram library, thanks to their

2 Jan 03, 2022
Checks instagram names to see if they're available

How to install You must have python 3.7.6 installed and make sure you click the 'ADD TO PATH' option when installing Open cmd and type pip install aio

2 Oct 20, 2021
asyncio client for Deta Cloud

aiodeta Unofficial client for Deta Clound Install pip install aiodeta Supported functionality Deta Base Deta Drive Decorator for cron tasks Examples i

Andrii Leitsius 19 Feb 14, 2022
Shuffle and add items from jellyfin to mpd (use in tandem with jellyfin-mopidy and mpd-mopidy). Similar to ncmpcpp's "Add random" feature..

jellyshuf Essentially implements ncmpcpp's add random feature (default hotkey: `) through a script which grabs info from jellyfin api itself. jellyfin

Ethan Djeric 2 Dec 14, 2021
A decentralized messaging daemon built on top of the Kademlia routing protocol.

parakeet-message A decentralized messaging daemon built on top of the Kademlia routing protocol. Now that you are done laughing... pictures what is it

Jonathan Abbott 3 Apr 23, 2022
Auto-Approved-Bot - Auto Approved Invaite Link Request Telegram Bot

πŸ€– π—”π˜‚π˜π—Ό-π—”π—½π—½π—Ώπ—Όπ˜ƒπ—²-π—•π—Όπ˜ πŸ€– ℹ️ π—¨π˜€π—²π—΄π—² ℹ️ When a join request invita

Muhammed 32 Dec 18, 2022
Asca - Antiscam Discord Bot With Python

asca Antiscam Discord Bot Asca moderates scammers and deletes scam messages Opti

11 Nov 01, 2022
Get informed when your DeFI Earn CRO Validator is jailed or changes the commission rate.

CRO-DeFi-Warner Intro CRO-DeFi-Warner can be used to notify you when a validator changes the commission rate or gets jailed. It can also notify you wh

5 May 16, 2022
:lock: Python 2.7/3.X client for HashiCorp Vault

hvac HashiCorp Vault API client for Python 3.x Tested against the latest release, HEAD ref, and 3 previous minor versions (counting back from the late

hvac 1k Dec 29, 2022
Fastest Tiktok Username checker on site.

Tiktok Username Checker Fastest Tiktok Username checker on site

sql 3 Jun 19, 2021
One of the best Telegram renamer bot with many new features

Renamer-Bot I think this repo gonna become one of the best renamer open source πŸ₯° . Please Give a ⭐ if you like this repo and also try following me fo

Ns Bots 97 Jan 06, 2023
Parse 11.000 free proxies!

Proxy Machine Description I did this project in order to boost views with the teleboost ✈️ in my Telegram channel. You can use it not only for boostin

VLDSLV 77 Jan 08, 2023
Command-line program to download image galleries and collections from several image hosting sites

gallery-dl gallery-dl is a command-line program to download image galleries and collections from several image hosting sites (see Supported Sites). It

Mike FΓ€hrmann 6.4k Jan 06, 2023
Image captioning service for healthcare domains in Vietnamese using VLP

Image captioning service for healthcare domains in Vietnamese using VLP This service is a web service that provides image captioning services for heal

CS-UIT AI Club 2 Nov 04, 2021
Tesseract Open Source OCR Engine (main repository)

Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM

48.3k Jan 05, 2023
The Dolby.io Developer Days Getting Started with Media APIs Workshop repo.

Dolby.io Developer Days Media APIs Getting Started Application About this Workshop and Application This example is designed to get participants workin

Dolby.io Samples 2 Nov 03, 2022
Simple Self-Bot for Discord

KeunoBot 🐼 -Simple Self-Bot for Discord KEUNOBOT 🐼 - Run KeunoBot : /* - Install KeunoBot - Extract it - Run setup.bat - Set token and prefi

Bidouffe 2 Mar 10, 2022
Python Capfire API wrapper

General CampfireAPI based on Campfire web. Install pip install Campfire-API Quickstart Use it without login: from campfire_api import CampfireAPI cf

Ghost 0 Jan 03, 2022
A Python library for the Buildkite API

PyBuildkite A Python library and client for the Buildkite API. Usage To get the package, execute: pip install pybuildkite Then set up an instance of

Peter Yasi 29 Nov 30, 2022
API Wrapper for seedr.cc

Seedr Python Client Seedr API built with πŸ’› by Souvik Pratiher Hit that Star button if you like this kind of SDKs and wants more of similar SDKs for o

Souvik Pratiher 2 Oct 24, 2021