Inverted index creation and query search mechanism on Wikipedia pages.

Overview

WikiPedia Search Engine

Step 1 : Installing Requirements

Install "stemming" module for python using pip.

Step 2 : Parsing the Data

To parse the data, run the file "wikiIndexer.py"

To run the file, the syntax is: "python WikipediaIndexer.py "

It will parse the whole dump and file the index files in the 'indexFiles' directory. It also creates the document to title mapping file in the current directory named 'docTitleMap.txt' which will be used by the search module later. (NOTE : As of current code, it pushes the index to disk for every 5000 documents encountered. This can be changed by changing the value of ' WRITE_PAGES_TO_FILE' macro in config.py)

Step 3 : Merging the Indexes and Creating Secondary Indexes

This task is done by WikipediaIndexer.py itself. There isn't any need for separate command. It takes the index files from 'pathOfFolder' directory and populates the 'finalIndex' directory with indexes of given chunk size and creates a secondary index named 'secondaryIndex.txt' in the same folder. (NOTE : As of the current code, the chunk size is kept 20000)

Step 4 : Running the Search Engine

To search for queries, run the file 'search.py'. It loads the index from 'finalIndex' (both primary and secondary). It also uses the file 'titleoffset.txt' (which must be in the working directory) to display titles corresponding to the docIDs. After it loads up, it gives the user a prompt to enter the query. After that the result of query is displayed. (top K results).

The format of result is:

	
   
     : 
    
      (Here the DocID is from the XML Dump)
	...
	
     
    
   

To specify normal queries, type them normally. For Field Queries, follow the format:

	f1:
   
     f2:
    
      ...
	where f1,f2 are fields : t - title, b - body, r - references, i - infobox, c - categories, e - external links

    
   
Owner
Piyush Atri
Piyush Atri
High level Python client for Elasticsearch

Elasticsearch DSL Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built o

elastic 3.6k Dec 30, 2022
A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

A library for fast import of Windows NT Registry(REGF) into Elasticsearch.

S.Nakano 3 Apr 01, 2022
User-friendly, tiny source code searcher written by pure Python.

User-friendly, tiny source code searcher written in pure Python. Example Usages Cat is equivalent in the regular expression as '^Cat$' bor class Cat

Furkan Onder 106 Nov 02, 2022
A real-time tech course finder, created using Elasticsearch, Python, React+Redux, Docker, and Kubernetes.

A real-time tech course finder, created using Elasticsearch, Python, React+Redux, Docker, and Kubernetes.

Dinesh Sonachalam 130 Dec 20, 2022
This is a Telegram Bot written in Python for searching data on Google Drive.

This is a Telegram Bot written in Python for searching data on Google Drive. Supports multiple Shared Drives (TDs). Manual Guide for deploying the bot

Levi 158 Dec 27, 2022
An image inline search telegram bot.

Image-Search-Bot An image inline search telegram bot. Note: Use Telegram picture bot. That is better. Not recommending to deploy this bot. Made with P

Fayas Noushad 24 Oct 21, 2022
MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch.

MeiliSearch FastAPI MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch. Installation Using a virtual environmnet is recommen

Paul Sanders 29 Nov 18, 2022
A sphinx extension for designing beautiful, screen-size responsive web components.

sphinx-design A sphinx extension for designing beautiful, view size responsive web components. Created with inspiration from Bootstrap (v5), Material

Executable Books 109 Jan 01, 2023
A Python web searcher library with different search engines

Robert A simple Python web searcher library with different search engines. Install pip install roberthelper Usage from robert import GoogleSearcher

1 Dec 23, 2021
A play store search application programming interface ( API )

Play-Store-API A play store search application programming interface ( API ) Made with Python3

Fayas Noushad 8 Oct 21, 2022
rclip - AI-Powered Command-Line Photo Search Tool

rclip is a command-line photo search tool based on the awesome OpenAI's CLIP neural network.

Yurij Mikhalevich 394 Dec 12, 2022
Free and Open, Distributed, RESTful Search Engine

Elasticsearch Elasticsearch is the distributed, RESTful search and analytics engine at the heart of the Elastic Stack. You can use Elasticsearch to st

elastic 62.4k Jan 08, 2023
ElasticSearch ODM (Object Document Mapper) for Python - pip install esengine

esengine - The Elasticsearch Object Document Mapper esengine is an ODM (Object Document Mapper) it maps Python classes in to Elasticsearch index/doc_t

SEEK International AI 109 Nov 22, 2022
cve-search - a tool to perform local searches for known vulnerabilities

cve-search cve-search is a tool to import CVE (Common Vulnerabilities and Exposures) and CPE (Common Platform Enumeration) into a MongoDB to facilitat

cve-search 2k Jan 01, 2023
Jina allows you to build deep learning-powered search-as-a-service in just minutes

Cloud-native neural search framework for any kind of data

Jina AI 17k Dec 31, 2022
Python Elasticsearch handler for the standard python logging framework

Python Elasticsearch Log handler This library provides an Elasticsearch logging appender compatible with the python standard logging library. This lib

Mohammed Mousa 0 Dec 08, 2021
txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

NeuML 3.1k Dec 31, 2022
A fast, efficiency python package for searching and getting search results with many different search engines

search A fast, efficiency python package for searching and getting search results with many different search engines. Installation To install the pack

Neurs 0 Oct 06, 2022
PwnWiki 数据库搜索命令行工具;该工具有点像 searchsploit 命令,只是搜索的不是 Exploit Database 而是 PwnWiki 条目

PWSearch PwnWiki 数据库搜索命令行工具。该工具有点像 searchsploit 命令,只是搜索的不是 Exploit Database 而是 PwnWiki 条目。

K4YT3X 72 Dec 20, 2022
Pythonic Lucene - A simplified python impelementaiton of Apache Lucene

A simplified python impelementaiton of Apache Lucene, mabye helps to understand how an enterprise search engine really works.

Mahdi Sadeghzadeh Ghamsary 2 Sep 12, 2022