An open source, non-profit search engine implemented in python

Overview

Mwmbl: No ads, no tracking, no cruft, no profit

Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed. At the moment it is little more than an idea together with a proof of concept implementation of the web front-end and search technology on a very small index. A crawler is still to be implemented.

Our vision is a community working to provide top quality search particularly for hackers, funded purely by donations.

Why a non-profit search engine?

The motives of ad-funded search engine are at odds with providing an optimal user experience. These sites are optimised for ad revenue, with user experience taking second place. This means that pages are loaded with ads which are often not clearly distinguished from search results. Also, eitland on Hacker News comments:

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - [to some] degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

But what about...?

The space of alternative search engines has expanded rapidly in recent years. Here's a very incomplete list of some that have interested me:

  • YaCy - an open source distributed search engine
  • search.marginalia.nu - a search engine favouring text-heavy websites
  • Gigablast - a privacy-focused search engine whose owner makes money by selling the technology to third parties
  • Brave
  • DuckDuckGo

Of these, YaCy is the closest in spirit to the idea of a non-profit search engine. The index is distributed across a peer-to-peer network. Unfortunately this design decision makes search very slow.

Marginalia Search is fantastic, but it is more of a personal project than an open source community.

All other search engines that I've come across are for-profit. Please let me know if I've missed one!

Designing for non-profit

To be a good search engine, we need to store many items, but the cost of running the engine is at least proportional to the number of items stored. Our main consideration is thus to reduce the cost per item stored.

The design is founded on the observation that most items rank for a small set of terms. In the extreme version of this, where each item ranks for a single term, the usual inverted index design is grossly inefficient, since we have to store each term at least twice: once in the index and once in the item data itself.

Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items. Given a term for which we want an item to rank, we compute a hash of the term, a value between 0 and N - 1. The item is then stored in the corresponding page.

To retrieve pages, we simply compute the hash of the terms in the user query and load the corresponding pages, filter the items to those containing the term and rank the items. Since each page is small, this can be done very quickly.

Because we compress the list of items, we can rank for more than a single term and maintain an index smaller than the inverted index design. Well, that's the theory. This idea has yet to be tested out on a large scale.

Crawling

Our current index is a small sample of the excellent Common Crawl, restricted to English content and domains which score highly on average in Hacker News submissions. It is likely for a variety of reasons that we will want to go beyond Common Crawl data at some point, so building a crawler becomes inevitable. We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.

How to contribute

There are lots of ways to help:

  • Volunteer to test out the distributed crawler when it's ready
  • Help out with development of the engine itself
  • Donate some money towards hosting costs and/or founding an official non-profit organisation

If you would like to help in any of these or other ways, thank you! Please email the main author (email address is in the git commit history).

Eland is a Python Elasticsearch client for exploring and analyzing data in Elasticsearch with a familiar Pandas-compatible API.

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

elastic 463 Dec 30, 2022
Es-schema - Common Data Schemas for Elasticsearch

Common Data Schemas for Elasticsearch The Common Data Schema for Elasticsearch i

Tim Schnell 2 Jan 25, 2022
An open source, non-profit search engine implemented in python

Mwmbl: No ads, no tracking, no cruft, no profit Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and

639 Jan 04, 2023
User-friendly, tiny source code searcher written by pure Python.

User-friendly, tiny source code searcher written in pure Python. Example Usages Cat is equivalent in the regular expression as '^Cat$' bor class Cat

Furkan Onder 106 Nov 02, 2022
Pysolr — Python Solr client

pysolr pysolr is a lightweight Python client for Apache Solr. It provides an interface that queries the server and returns results based on the query.

Haystack Search 626 Dec 01, 2022
A play store search application programming interface ( API )

Play-Store-API A play store search application programming interface ( API ) Made with Python3

Fayas Noushad 8 Oct 21, 2022
MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch.

MeiliSearch FastAPI MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch. Installation Using a virtual environmnet is recommen

Paul Sanders 29 Nov 18, 2022
Python Elasticsearch handler for the standard python logging framework

Python Elasticsearch Log handler This library provides an Elasticsearch logging appender compatible with the python standard logging library. This lib

Mohammed Mousa 0 Dec 08, 2021
rclip - AI-Powered Command-Line Photo Search Tool

rclip is a command-line photo search tool based on the awesome OpenAI's CLIP neural network.

Yurij Mikhalevich 394 Dec 12, 2022
document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

Manos Pitsidianakis 152 Oct 29, 2022
A sentence search engine that fetches examples from trusted news/media organisations. Great for writing better English.

A sentence search engine that fetches examples from trusted news/media websites. Great for improving writing & speaking better English.

Stephen Appiah 1 Apr 04, 2022
PwnWiki 数据库搜索命令行工具;该工具有点像 searchsploit 命令,只是搜索的不是 Exploit Database 而是 PwnWiki 条目

PWSearch PwnWiki 数据库搜索命令行工具。该工具有点像 searchsploit 命令,只是搜索的不是 Exploit Database 而是 PwnWiki 条目。

K4YT3X 72 Dec 20, 2022
This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

Karn Deb 49 Oct 30, 2022
Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

Senginta is All in one Search Engine Scrapper. With traditional scrapping, Senginta can be powerful to get result from any Search Engine, and convert to Json. Now support only for Google Product Sear

33 Nov 21, 2022
A sphinx extension for designing beautiful, screen-size responsive web components.

sphinx-design A sphinx extension for designing beautiful, view size responsive web components. Created with inspiration from Bootstrap (v5), Material

Executable Books 109 Jan 01, 2023
ElasticSearch ODM (Object Document Mapper) for Python - pip install esengine

esengine - The Elasticsearch Object Document Mapper esengine is an ODM (Object Document Mapper) it maps Python classes in to Elasticsearch index/doc_t

SEEK International AI 109 Nov 22, 2022
Google Search Engine Results Pages (SERP) in locally, no API key, no signup required

Local SERP Google Search Engine Results Pages (SERP) in locally, no API key, no signup required Make sure the chromedriver and required package are in

theblackcat102 4 Jun 29, 2021
An image inline search telegram bot.

Image-Search-Bot An image inline search telegram bot. Note: Use Telegram picture bot. That is better. Not recommending to deploy this bot. Made with P

Fayas Noushad 24 Oct 21, 2022
ForFinder is a search tool for folder and files

ForFinder is a search tool for folder and files. You can use that when you Source Code Analysis at your project's local files or other projects that you are download. Enter a root path and keyword to

Çağrı Aliş 7 Oct 25, 2022
Free and Open, Distributed, RESTful Search Engine

Elasticsearch Elasticsearch is the distributed, RESTful search and analytics engine at the heart of the Elastic Stack. You can use Elasticsearch to st

elastic 62.4k Jan 08, 2023