Kroomsa: A search engine for the curious

Overview

Kroomsa

Kroomsa

A search engine for the curious. It is a search algorithm designed to engage users by exposing them to relevant yet interesting content during their session.

Description

The search algorithm implemented in your website greatly influences visitor engagement. A decent implementation can significantly reduce dependency on standard search engines like Google for every query thus, increasing engagement. Traditional methods look at terms or phrases in your query to find relevant content based on syntactic matching. Kroomsa uses semantic matching to find content relevant to your query. There is a blog post expanding upon Kroomsa's motivation and its technical aspects.

Getting Started

Prerequisites

  • Python 3.6.5
  • Run the project directory setup: python3 ./setup.py in the root directory.
  • Tensorflow's Universal Sentence Encoder 4
    • The model is available at this link. Download the model and extract the zip file in the /vectorizer directory.
  • MongoDB is used as the database to collate Reddit's submissions. MongoDB can be installed following this link.
  • To fetch comments of the reddit submissions, PRAW is used. To scrape credentials are needed that authorize the script for the same. This is done by creating an app associated with a reddit account by following this link. For reference you can follow this tuorial written by Shantnu Tiwari.
    • Register multiple instances and retrieve their credentials, then add them to the /config under bot_codes parameter in the following format: "client_id client_secret user_agent" as list elements separated by ,.
  • Docker-compose (For dockerized deployment only): Install the latest version following this link.

Installing

  • Create a python environment and install the required packages for preprocessing using: python3 -m pip install -r ./preprocess_requirements.txt
  • Collating a dataset of Reddit submissions
    • Scraping posts
      • Pushshift's API is being used to fetch Reddit submissions. In the root directory, run the following command: python3 ./pre_processing/scraping/questions/scrape_questions.py. It launches a script that scrapes the subreddits sequentially till their inception and stores the submissions as JSON objects in /pre_processing/scraping/questions/scraped_questions. It then partitions the scraped submissions into as many equal parts as there are registered instances of bots.
    • Scraping comments
      • After populating the configuration with bot_codes, we can begin scraping the comments using the partitioned submission files created while scraping submissions. Using the following command: python3 ./pre_processing/scraping/comments/scrape_comments.py multiple processes are spawned that fetch comment streams simultaneously.
    • Insertion
      • To insert the submissions and associated comments, use the following commands: python3 ./pre_processing/db_insertion/insertion.py. It inserts the posts and associated comments in mongo.
      • To clean the comments and tag the posts that aren't public due to any reason, Run python3 ./post_processing/post_processing.py. Apart from cleaning, it also adds emojis to each submission object (This behavior is configurable).
  • Creating a FAISS Index
    • To create a FAISS index, run the following command: python3 ./index/build_index.py. By default, it creates an exhaustive IDMap, Flat index but is configurable through the /config.
  • Database dump (For dockerized deployment)
    • For dockerized deployment, a database dump is required in /mongo_dump. Use the following command at the root dir to create a database dump. mongodump --db database_name(default: red) --collection collection_name(default: questions) -o ./mongo_dump.

Execution

  • Local deployment (Using Gunicorn)
    • Create a python environment and install the required packages using the following command: python3 -m pip install -r ./inference_requirements.txt
    • A local instance of Kroomsa can be deployed using the following command: gunicorn -c ./gunicorn_config.py server:app
  • Dockerized demo
    • Set the demo_mode to True in /config.
    • Build images: docker-compose build
    • Deploy: docker-compose up

Authors

License

This project is licensed under the Apache License Version 2.0

A general-purpose programming language, focused on simplicity, safety and stability.

The Rivet programming language A general-purpose programming language, focused on simplicity, safety and stability. Rivet's goal is to be a very power

The Rivet programming language 17 Dec 29, 2022
GANSketchingJittor - Implementation of Sketch Your Own GAN in Jittor

GANSketching in Jittor Implementation of (Sketch Your Own GAN) in Jittor(计图). Or

Bernard Tan 10 Jul 02, 2022
Mall-Customers-Segmentation - Customer Segmentation Using K-Means Clustering

Overview Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify th

NelakurthiSudheer 2 Jan 03, 2022
Video Matting Refinement For Python

Video-matting refinement Library (use pip to install) scikit-image numpy av matplotlib Run Static background python path_to_video.mp4 Moving backgroun

3 Jan 11, 2022
Personals scripts using ageitgey/face_recognition

HOW TO USE pip3 install requirements.txt Add some pictures of known people in the folder 'people' : a) Create a folder called by the name of the perso

Antoine Bollengier 1 Jan 06, 2022
The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition.

OverlapTransformer The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for

HAOMO.AI 136 Jan 03, 2023
GRF: Learning a General Radiance Field for 3D Representation and Rendering

GRF: Learning a General Radiance Field for 3D Representation and Rendering [Paper] [Video] GRF: Learning a General Radiance Field for 3D Representatio

Alex Trevithick 243 Dec 29, 2022
Federated learning on graph, especially on graph neural networks (GNNs), knowledge graph, and private GNN.

Federated learning on graph, especially on graph neural networks (GNNs), knowledge graph, and private GNN.

keven 198 Dec 20, 2022
A library of scripts that interact with the PythonTurtle module to create games, drawings, and more

TurtleLib TurtleLib is a library of scripts that interact with the PythonTurtle module to create games, drawings, and more! Using the Scripts Copy or

1 Jan 15, 2022
Tensorflow2 Keras-based Semantic Segmentation Models Implementation

Tensorflow2 Keras-based Semantic Segmentation Models Implementation

Hah Min Lew 1 Feb 08, 2022
Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Streamlit Tutorials Install pip install streamlit Run cd [directory] streamlit run app.py --server.address 0.0.0.0 --server.port [your port] # http:/

Jihye Back 30 Jan 06, 2023
《Dual-Resolution Correspondence Network》(NeurIPS 2020)

Dual-Resolution Correspondence Network Dual-Resolution Correspondence Network, NeurIPS 2020 Dependency All dependencies are included in asset/dualrcne

Active Vision Laboratory 45 Nov 21, 2022
OSLO: Open Source framework for Large-scale transformer Optimization

O S L O Open Source framework for Large-scale transformer Optimization What's New: December 21, 2021 Released OSLO 1.0. What is OSLO about? OSLO is a

TUNiB 280 Nov 24, 2022
i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery This is a public code repository for the publication: i-SpaSP: Structured Neural Pruning

Cameron Ronald Wolfe 5 Nov 04, 2022
A Pytorch implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE

SMU_pytorch A Pytorch Implementation of SMU: SMOOTH ACTIVATION FUNCTION FOR DEEP NETWORKS USING SMOOTHING MAXIMUM TECHNIQUE arXiv https://arxiv.org/ab

Fuhang 36 Dec 24, 2022
This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. (CVPR 2022)"

Gait3D-Benchmark This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild

82 Jan 04, 2023
3rd place solution for the Weather4cast 2021 Stage 1 Challenge

weather4cast2021_Stage1 3rd place solution for the Weather4cast 2021 Stage 1 Challenge Dependencies The code can be executed from a fresh environment

5 Aug 14, 2022
A full-fledged version of Pix2Seq

Stable-Pix2Seq A full-fledged version of Pix2Seq What it is. This is a full-fledged version of Pix2Seq. Compared with unofficial-pix2seq, stable-pix2s

peng gao 205 Dec 27, 2022
curl-impersonate: A special compilation of curl that makes it impersonate Chrome & Firefox

curl-impersonate A special compilation of curl that makes it impersonate real browsers. It can impersonate the four major browsers: Chrome, Edge, Safa

lwthiker 1.9k Jan 03, 2023
[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

Wenhao Wu 114 Nov 27, 2022