An open source, non-profit search engine implemented in python

Last update: Jan 04, 2023

Related tags

Overview

Mwmbl: No ads, no tracking, no cruft, no profit

Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed. At the moment it is little more than an idea together with a proof of concept implementation of the web front-end and search technology on a very small index. A crawler is still to be implemented.

Our vision is a community working to provide top quality search particularly for hackers, funded purely by donations.

Why a non-profit search engine?

The motives of ad-funded search engine are at odds with providing an optimal user experience. These sites are optimised for ad revenue, with user experience taking second place. This means that pages are loaded with ads which are often not clearly distinguished from search results. Also, eitland on Hacker News comments:

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - [to some] degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

But what about...?

The space of alternative search engines has expanded rapidly in recent years. Here's a very incomplete list of some that have interested me:

YaCy - an open source distributed search engine
search.marginalia.nu - a search engine favouring text-heavy websites
Gigablast - a privacy-focused search engine whose owner makes money by selling the technology to third parties
Brave
DuckDuckGo

Of these, YaCy is the closest in spirit to the idea of a non-profit search engine. The index is distributed across a peer-to-peer network. Unfortunately this design decision makes search very slow.

Marginalia Search is fantastic, but it is more of a personal project than an open source community.

All other search engines that I've come across are for-profit. Please let me know if I've missed one!

Designing for non-profit

To be a good search engine, we need to store many items, but the cost of running the engine is at least proportional to the number of items stored. Our main consideration is thus to reduce the cost per item stored.

The design is founded on the observation that most items rank for a small set of terms. In the extreme version of this, where each item ranks for a single term, the usual inverted index design is grossly inefficient, since we have to store each term at least twice: once in the index and once in the item data itself.

Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items. Given a term for which we want an item to rank, we compute a hash of the term, a value between 0 and N - 1. The item is then stored in the corresponding page.

To retrieve pages, we simply compute the hash of the terms in the user query and load the corresponding pages, filter the items to those containing the term and rank the items. Since each page is small, this can be done very quickly.

Because we compress the list of items, we can rank for more than a single term and maintain an index smaller than the inverted index design. Well, that's the theory. This idea has yet to be tested out on a large scale.

Crawling

Our current index is a small sample of the excellent Common Crawl, restricted to English content and domains which score highly on average in Hacker News submissions. It is likely for a variety of reasons that we will want to go beyond Common Crawl data at some point, so building a crawler becomes inevitable. We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.

How to contribute

There are lots of ways to help:

Volunteer to test out the distributed crawler when it's ready
Help out with development of the engine itself
Donate some money towards hosting costs and/or founding an official non-profit organisation

If you would like to help in any of these or other ways, thank you! Please email the main author (email address is in the git commit history).

An open source, non-profit search engine implemented in python

Related tags

Overview

Mwmbl: No ads, no tracking, no cruft, no profit

Why a non-profit search engine?

But what about...?

Designing for non-profit

Crawling

How to contribute

Owner

TG-searcherBot - Search any channel/chat from keyword

Google Drive file searcher

基于RSSHUB阅读器实现的获取P站排行和P站搜图，使用时需使用代理

A search engine to query social media insights with political theme

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

Python script for finding duplicate images within a folder.

Image search service based on imgsmlr extension of PostgreSQL. Support image search by image.

A Python web searcher library with different search engines

A play store search application programming interface ( API )

Modular search for Django

cve-search - a tool to perform local searches for known vulnerabilities

Python Elasticsearch handler for the standard python logging framework

Wagtail CLIP allows you to search your Wagtail images using natural language queries.

🔍 Messages Searcher is make for search custom message in all channels in guild and dm.

rclip - AI-Powered Command-Line Photo Search Tool

High level Python client for Elasticsearch

Inverted index creation and query search mechanism on Wikipedia pages.

Yuno is context based search engine for anime.

Pysolr — Python Solr client

An open source, non-profit search engine implemented in python