Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Overview

Open Semantic Search

https://opensemanticsearch.org

Integrated search server, ETL framework for document processing (crawling, text extraction, text analysis, named entity recognition and OCR for images and embedded images in PDF), search user interfaces, text mining, text analytics and search apps for fulltext search, faceted search, exploratory search and knowledge graph search

Build

How to build the deb package for installation on Debian or Ubuntu server or the docker images for running in Docker containers:

Build deb package

To build a deb package for Debian or Ubuntu, call the build script "build-deb" as user root (change user by su or sudo su):

./build-deb

Build docker images

Clone the repository including the dependencies :

git clone --recurse-submodules --remote-submodules https://github.com/opensemanticsearch/open-semantic-search.git

Inside the opensemanticsearch directory, build the Docker images use the docker-compose config docker-compose.yml :

cd opensemanticsearch
docker-compose build

After these builds all the Docker images/dependencies/services can by started together by docker-compose with the config file docker-compose.yml.

You can run the instance by typing :

docker-compose up

You can browse OpenSemanticSearch in your favourite browser at this url :

http://localhost:8080/search/

Automated tests

For CI/CD there are some different automated tests:

Integration tests

Since the submodule Open Semantic ETL uses and needs different powerful services like Solr, spacY-services or Tika-Server by HTTP and REST-API, the automated tests run as integration tests within the docker-compose environment configured in docker-compose.etl-test.yml so these services are available while running the unittests.

End to end tests

Some automated integration tests and end-to-end (E2E) tests within a web browser controlled by the browser automation framework playwright and the node.js / javascript based test framework JEST.

You can extend the automated tests in test/test.js

They run by the docker image Dockerfile-test and need the services of the docker-compose environment docker-compose.test.yml

Dependencies

Dependencies are resolved automatically by building or by installation of the Debian or Ubuntu packages or by building the Docker images.

Documentation on this dependecies which may help debugging dependency hell issues or installations in other environments:

Build dependencies on Source code (GIT)

Dependencies on other Git repositories / submodules of components like Open Semantic ETL are defined in the Git config file .gitmodules

The submodules will be checked out automatically to the subdirectory "src", if you check out this repository by git in recursive mode.

Packaging dependencies of Java archives (JAR)

The submodules tika.deb and solr.deb need the JAR of Apache Tika-Server and Apache Solr.

If not there, they will be downloaded from Apache Software Foundation by wget in the submodule "build" script or its "Dockerfile".

Installation dependencies on Debian/Ubuntu packages (DEB)

Dependecies of tools and libraries, which are available in the Debian or Ubuntu package repositories, are defined in the section "Depends" of the deb package config file DEBIAN/control

https://github.com/opensemanticsearch/open-semantic-search/blob/master/DEBIAN/control

Installation dependencies on Python packages (PIP)

Dependecies of Python libraries which are not available as packages of the Linux distribution but in Python Package Index (PyPI), are defined in

https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/requirements.txt

This dependencies will be installed automatically on installation of the Debian/Ubuntu packages by DEBIAN/postinst of the Debian/Ubuntu packages or by docker build configured by Dockerfile by

pip3 install -r /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt

Owner
Open Semantic Search
Search, analyze and explore large document collections by Open Source Search Engine, Text Mining, Document analysis and Text Analytics Explorer
Open Semantic Search
When Age-Invariant Face Recognition Meets Face Age Synthesis: A Multi-Task Learning Framework (CVPR 2021 oral)

MTLFace This repository contains the PyTorch implementation and the dataset of the paper: When Age-Invariant Face Recognition Meets Face Age Synthesis

Hzzone 120 Jan 05, 2023
Basic functions manipulating images using the OpenCV library

OpenCV Basic functions manipulating images using the OpenCV library. Reading Ima

Shatha Siala 3 Feb 17, 2022
OpenMMLab Text Detection, Recognition and Understanding Toolbox

Introduction English | 简体中文 MMOCR is an open-source toolbox based on PyTorch and mmdetection for text detection, text recognition, and the correspondi

OpenMMLab 3k Jan 07, 2023
Machine Leaning applied to denoise images to improve OCR Accuracy

Machine Learning to Denoise Images for Better OCR Accuracy This project is an adaptation of this tutorial and used only for learning purposes: https:/

Antonio Bri Pérez 2 Nov 16, 2022
Provides OCR (Optical Character Recognition) services through web applications

OCR4all As suggested by the name one of the main goals of OCR4all is to allow basically any given user to independently perform OCR on a wide variety

174 Dec 31, 2022
Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)

SA-AutoAug Scale-aware Automatic Augmentation for Object Detection Yukang Chen, Yanwei Li, Tao Kong, Lu Qi, Ruihang Chu, Lei Li, Jiaya Jia [Paper] [Bi

Jia Research Lab 182 Dec 29, 2022
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

SCUT-CTW1500 Datasets We have updated annotations for both train and test set. Train: 1000 images [images][annos] Additional point annotation for each

Yuliang Liu 600 Dec 18, 2022
🖺 OCR using tensorflow with attention

tensorflow-ocr 🖺 OCR using tensorflow with attention, batteries included Installation git clone --recursive http://github.com/pannous/tensorflow-ocr

646 Nov 11, 2022
Line based ATR Engine based on OCRopy

OCR Engine based on OCRopy and Kraken using python3. It is designed to both be easy to use from the command line but also be modular to be integrated

948 Dec 23, 2022
Um RPG de texto orientado a objetos.

RPG de texto Um RPG de texto orientado a objetos, sem história. Um RPG (Role-playing game) baseado em texto em que você pode viajar para alguns locais

Vinicius 3 Oct 05, 2022
A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

Language Machines 41 Dec 27, 2022
Color Picker and Color Detection tool for METR4202

METR4202 Color Detection Help This is sample code that can be used for the METR4202 project demo. There are two files provided, both running on Python

Miguel Valencia 1 Oct 23, 2021
Shape Detection - It's a shape detection project with OpenCV and Python.

Shape Detection It's a shape detection project with OpenCV and Python. Setup pip install opencv-python for doing AI things. pip install simpleaudio fo

1 Nov 26, 2022
Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

LayoutAnalysisEvaluator Layout Analysis Evaluator for: ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records ICD

17 Dec 08, 2022
Handwritten Character Recognition using CNN

Handwritten Character Recognition using CNN Problem Definition The main objective of this project is to solve the problem of handwritten character rec

Mohit Kaushik 4 Mar 02, 2022
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
Scene text detection and recognition based on Extremal Region(ER)

Scene text recognition A real-time scene text recognition algorithm. Our system is able to recognize text in unconstrain background. This algorithm is

HSIEH, YI CHIA 155 Dec 06, 2022
CNN+Attention+Seq2Seq

Attention_OCR CNN+Attention+Seq2Seq The model and its tensor transformation are shown in the figure below It is necessary ch_ train and ch_ test the p

Tsukinousag1 2 Jul 14, 2022
Code for CVPR 2022 paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

Bailando Code for CVPR 2022 (oral) paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory" [Paper] | [Project Page] | [Vi

Li Siyao 237 Dec 29, 2022
EQFace: An implementation of EQFace: A Simple Explicit Quality Network for Face Recognition

EQFace: A Simple Explicit Quality Network for Face Recognition The first face recognition network that generates explicit face quality online.

DeepCam Shenzhen 141 Dec 31, 2022