A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

A python package that allows you to place automated trades using the TD Ameritrade API.

Template Repo Table of Contents Overview Setup Usage Support These Projects Overview Setup Setup - Requirements Install:* For this particular project,

Alex Reed 4 Jan 25, 2022
BioThings API framework - Making high-performance API for biological annotation data

BioThings SDK Quick Summary BioThings SDK provides a Python-based toolkit to build high-performance data APIs (or web services) from a single data sou

BioThings 39 Jan 04, 2023
MVP monorepo to rapidly develop scalable, reliable, high-quality components for Amazon Linux instance configuration management

Ansible Amazon Base Repository Ansible Amazon Base Repository About Setting Up Ansible Environment Configuring Python VENV and Ansible Editor Configur

Artem Veremey 1 Aug 06, 2022
Provide fine-grained push access to GitHub from a JupyterHub

github-app-user-auth Provide fine-grained push access to GitHub from a JupyterHub. Goals Allow users on a JupyterHub to grant push access to only spec

Yuvi Panda 20 Sep 13, 2022
Discord-selfbot - Very basic discord self bot

discord-selfbot Very basic discord self bot still being actively developed requi

nana 4 Apr 07, 2022
This Telegram bot allows you to create direct links with pre-filled text to WhatsApp Chats

WhatsApp API Bot Telegram bot to create direct links with pre-filled text for WhatsApp Chats You can check our bot here. The bot is based on the API p

RobotTrick • רובוטריק 17 Aug 20, 2022
Pythonic wrapper for the Aladhan prayer times API.

aladhan.py is a pythonic wrapper for the Aladhan prayer times API. Installation Python 3.6 or higher is required. To Install aladhan.py with pip: pip

HETHAT 8 Aug 17, 2022
Paginator for Dis-Snek Python Discord API wrapper

snek-paginator Paginator for Dis-Snek Python Discord API wrapper Installation: pip install -U snek-paginator Basic Example: from dis_snek.client impo

1 Nov 04, 2021
PyDiscord, a maintained fork of discord.py, is a python wrapper for the Discord API.

discord.py A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. The Future of discord.py Please read the gi

Omkaar 1 Jan 16, 2022
A Telegram Bot to display Codeforces Contest Ranklist

CFRankListBot A bot that displays the top ranks for a Codeforces contest. Participants' Details All the details of a participant is in the utils/__ini

Code IIEST 5 Dec 25, 2021
A multi exploit instagram exploitation framework

Instagram Exploitation Framework About IEF Is an open source Instagram Exploitation Framework with various Exploits that could be used to mod your pro

Instagram Exploitation Framework - BirdSecurity 1 May 23, 2022
Linkvertise-Bypass - Bypass Linkvertise advertisement

Linkvertise-Bypass Bypass Linkvertise advertisement 📕 instructions Copy And Pas

Flex Tools 4 Jun 10, 2022
This repo provides the source code for "Cross-Domain Adaptive Teacher for Object Detection".

Cross-Domain Adaptive Teacher for Object Detection This is the PyTorch implementation of our paper: Cross-Domain Adaptive Teacher for Object Detection

Meta Research 91 Dec 12, 2022
A free tempmail api for your needs!

Tempmail A free tempmail api for your needs! Website · Report Bug · Request Feature Features Add your own private domains Easy to use documentation No

dropout 10 Oct 26, 2021
Sielzz Music adalah proyek bot musik telegram, memungkinkan Anda memutar musik di telegram grup obrolan suara.

Hi, I am: Requirements 📝 FFmpeg NodeJS nodesource.com Python 3.8 or higher PyTgCalls MongoDB Get STRING_SESSION from below: 🎖 History Features 🔮 Th

1 Nov 29, 2021
Quickly visualize docker networks with graphviz.

Docker Network Graph Visualize the relationship between Docker networks and containers as a neat graphviz graph. Example Usage usage: docker-net-graph

Leo Verto 43 Dec 12, 2022
A Telegram mirror bot which can be deployed using Heroku.

Slam Mirror Bot This is a telegram bot writen in python for mirroring files on the internet to our beloved Google Drive. Getting Google OAuth API cred

Hafitz Setya 1.2k Jan 01, 2023
Python wrapper to simplify calls to AncestryDNA API.

AncestryDNA API wrapper Ancestry exposes an undocumented REST API for its DNA features. This Python wrapper inventories the available calls, and expos

Matt 2 Jun 10, 2022
Lamblayer: a minimal deployment tool for AWS Lambda layers

lamblayer lamblayer is a minimal deployment tool for AWS Lambda layers. lamblayer does, Create a Layers of built pip-installable python packages. Crea

Yusuke Takahashi 2 Aug 19, 2022
An example of using discordpy 2.0.0a to create a bot that supports slash commands

DpySlashBotExample An example of using discordpy 2.0.0a to create a bot that supports slash commands. This is not a fully complete bot, just an exampl

7 Oct 17, 2022