Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Overview

Open Semantic Search

https://opensemanticsearch.org

Integrated search server, ETL framework for document processing (crawling, text extraction, text analysis, named entity recognition and OCR for images and embedded images in PDF), search user interfaces, text mining, text analytics and search apps for fulltext search, faceted search, exploratory search and knowledge graph search

Build

How to build the deb package for installation on Debian or Ubuntu server or the docker images for running in Docker containers:

Build deb package

To build a deb package for Debian or Ubuntu, call the build script "build-deb" as user root (change user by su or sudo su):

./build-deb

Build docker images

Clone the repository including the dependencies :

git clone --recurse-submodules --remote-submodules https://github.com/opensemanticsearch/open-semantic-search.git

Inside the opensemanticsearch directory, build the Docker images use the docker-compose config docker-compose.yml :

cd opensemanticsearch
docker-compose build

After these builds all the Docker images/dependencies/services can by started together by docker-compose with the config file docker-compose.yml.

You can run the instance by typing :

docker-compose up

You can browse OpenSemanticSearch in your favourite browser at this url :

http://localhost:8080/search/

Automated tests

For CI/CD there are some different automated tests:

Integration tests

Since the submodule Open Semantic ETL uses and needs different powerful services like Solr, spacY-services or Tika-Server by HTTP and REST-API, the automated tests run as integration tests within the docker-compose environment configured in docker-compose.etl-test.yml so these services are available while running the unittests.

End to end tests

Some automated integration tests and end-to-end (E2E) tests within a web browser controlled by the browser automation framework playwright and the node.js / javascript based test framework JEST.

You can extend the automated tests in test/test.js

They run by the docker image Dockerfile-test and need the services of the docker-compose environment docker-compose.test.yml

Dependencies

Dependencies are resolved automatically by building or by installation of the Debian or Ubuntu packages or by building the Docker images.

Documentation on this dependecies which may help debugging dependency hell issues or installations in other environments:

Build dependencies on Source code (GIT)

Dependencies on other Git repositories / submodules of components like Open Semantic ETL are defined in the Git config file .gitmodules

The submodules will be checked out automatically to the subdirectory "src", if you check out this repository by git in recursive mode.

Packaging dependencies of Java archives (JAR)

The submodules tika.deb and solr.deb need the JAR of Apache Tika-Server and Apache Solr.

If not there, they will be downloaded from Apache Software Foundation by wget in the submodule "build" script or its "Dockerfile".

Installation dependencies on Debian/Ubuntu packages (DEB)

Dependecies of tools and libraries, which are available in the Debian or Ubuntu package repositories, are defined in the section "Depends" of the deb package config file DEBIAN/control

https://github.com/opensemanticsearch/open-semantic-search/blob/master/DEBIAN/control

Installation dependencies on Python packages (PIP)

Dependecies of Python libraries which are not available as packages of the Linux distribution but in Python Package Index (PyPI), are defined in

https://github.com/opensemanticsearch/open-semantic-etl/blob/master/src/opensemanticetl/requirements.txt

This dependencies will be installed automatically on installation of the Debian/Ubuntu packages by DEBIAN/postinst of the Debian/Ubuntu packages or by docker build configured by Dockerfile by

pip3 install -r /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt

Owner
Open Semantic Search
Search, analyze and explore large document collections by Open Source Search Engine, Text Mining, Document analysis and Text Analytics Explorer
Open Semantic Search
FOTS Pytorch Implementation

News!!! Recognition branch now is added into model. The whole project has beed optimized and refactored. ICDAR Dataset SynthText 800K Dataset detectio

Ning Lu 599 Dec 19, 2022
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

Quick Info this library tries to solve language detection of very short words and phrases, even shorter than tweets makes use of both statistical and

Peter M. Stahl 532 Dec 28, 2022
Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts

LayoutAnalysisEvaluator Layout Analysis Evaluator for: ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records ICD

17 Dec 08, 2022
CNN+LSTM+CTC based OCR implemented using tensorflow.

CNN_LSTM_CTC_Tensorflow CNN+LSTM+CTC based OCR(Optical Character Recognition) implemented using tensorflow. Note: there is No restriction on the numbe

Watson Yang 356 Dec 08, 2022
Binarize document images

Binarization Binarization for document images Examples Introduction This tool performs document image binarization (i.e. transform colour/grayscale to

QURATOR-SPK 48 Jan 02, 2023
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Org. Account 165 Dec 31, 2022
A simple python program to record security cam footage by detecting a face and body of a person in the frame.

SecurityCam A simple python program to record security cam footage by detecting a face and body of a person in the frame. This code was created by me,

1 Nov 08, 2021
Links to awesome OCR projects

Awesome OCR This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR). Contribution

Konstantin Baierer 2.2k Jan 02, 2023
This is the code for our paper DAAIN: Detection of Anomalous and AdversarialInput using Normalizing Flows

Merantix-Labs: DAAIN This is the code for our paper DAAIN: Detection of Anomalous and Adversarial Input using Normalizing Flows which can be found at

Merantix 14 Oct 12, 2022
Amazing 3D explosion animation using Pygame module.

3D Explosion Animation 💣 💥 🔥 Amazing explosion animation with Pygame. 💣 Explosion physics An Explosion instance is made of a set of Particle objec

Dylan Tintenfich 12 Mar 11, 2022
Resizing Canny Countour In Python

Resizing_Canny_Countour Install Visual Studio Code , https://code.visualstudio.com/download Select Python and install with terminal( pip install openc

Walter Ng 1 Nov 07, 2021
Assignment work with webcam

work with webcam : Press key 1 to use emojy on your face Press key 2 to use lip and eye on your face Press key 3 to checkered your face Press key 4 to

Hanane Kheirandish 2 May 31, 2022
Implement 'Single Shot Text Detector with Regional Attention, ICCV 2017 Spotlight'

SSTDNet Implement 'Single Shot Text Detector with Regional Attention, ICCV 2017 Spotlight' using pytorch. This code is work for general object detecti

HotaekHan 84 Jan 05, 2022
https://arxiv.org/abs/1904.01941

Character-Region-Awareness-for-Text-Detection- https://arxiv.org/abs/1904.01941 Train You can train SynthText data use python source/train_SynthText.p

DayDayUp 120 Dec 28, 2022
learn how to use Gesture Control to change the volume of a computer

Volume-Control-using-gesture In this project we are going to learn how to use Gesture Control to change the volume of a computer. We first look into h

Diwas Pandey 49 Sep 22, 2022
Image Smoothing and Blurring Using OpenCV

Image-Smoothing-and-Blurring-Using-OpenCV This repository contains codes for performing image smoothing and blurring using OpenCV. There are different

Happy N. Monday 3 Feb 15, 2022
This is used to convert a string to an Image with Handwritten Characters.

Text-to-Handwriting-using-python This is used to convert a string to an Image with Handwritten Characters. text_to_handwriting(string: str, save_to: s

Akashdeep Mahata 3 Aug 15, 2022
Perspective recovery of text using transformed ellipses

unproject_text Perspective recovery of text using transformed ellipses. See full writeup at https://mzucker.github.io/2016/10/11/unprojecting-text-wit

Matt Zucker 111 Nov 13, 2022
QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021)

QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021) Yuanming Hu, Jiafeng Liu, Xuanda Yang, Mingkuan Xu, Ye Kuang, Weiwei Xu, Qiang Dai, W

Taichi Developers 119 Dec 02, 2022