Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

Last update: Nov 24, 2022

Overview

TechSEO Crawler

Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

Play with the results here: Simple Search Engine

Please Note: The link above is hosted on a small AWS box, so if you have issues loading, try again later.

Slideshare is here: Building a Simple Crawler on a Toy Internet

Description

Web Folder

In order to crawl a small internet of sites, we have to create it. This tool creates 3 small sites from Wikipedia data and hosts them on Github Pages. The sites are not linked to any other site on the internet, but are linked to each other.

Main function

This tool attempts to implement a small ecosystem of 3 websites, along with a simple crawler, renderer, and indexer. While the author did research to construct the repo, it was a design feature to prefer simplicity over complexity. Items that are part of large crawling infrastructures, most notably disparate systems, and highly efficient code and data storage, are not part of this repo. We focus on simple representations of items such that it is more accessible to newer developers.

Parts:

PageRank
Chrome Headless Rendering
Text NLP Normalization
Bert Embeddings
Robots
Duplicate Content Shingling
URL Hashing
Document Frequency Functions (BM25 and TFIDF)

Made for a presentation at Tech SEO Boost

Getting Started

Get the repo

git clone https://github.com/jroakes/tech-seo-crawler.git

Dependencies

Please see the requirements.txt file for a list of dependencies.

It is strongly suggested to do the following, first, in a new, clean environment.

May need to install [Microsoft Build Tools] (http://go.microsoft.com/fwlink/?LinkId=691126&fixForIE=.exe.) and upgrade setup tools pip install --upgrade setuptools if you are on Windows.
Install PyTorch pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
See requirements-libraries.txt file for remaining library requirements. To install the frozen requirements this was developed with, use pip install -r requirements.txt

Install with:

pip install -r requirements.txt

Executing program

Make sure you've created your three sites first. See README file in the web folder. Conversely, if you just want to use the crawler/renderer, you can run with the premade sites and skip to step 3.
After creating your three sites, go to the config file and add the crawler_seed URL. This will be the organization name you created on github.io. For example: myorganization.github.io/
Run streamlit run main.py in the terminal or command prompt. A new Browser window should open.
The tool can also be run interactively with the Run.ipynb notebook in Jupyter.

Sharing

If you want to share your search engine for others to see, you can use Streamlit and Localtunnel.

Install Localtunnel npm install -g localtunnel
Start the tunnel with lt --port 80 --subdomain <create a unique sub-domain name>
Start the Streamlit server with streamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress <the unique subdomain from step 2>.localtunnel.me
Navigate to https://<the unique subdomain from step 2>.localtunnel.me in your browser, or share the link with a friend.

Complete example:

In a new terminal:

npm install -g localtunnel
lt --port 80 --subdomain tech-seo-crawler

In another terminal:

cd /tech-seo-crawler/
activate techseo
streamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress tech-seo-crawler.localtunnel.me

Troubleshooting

When running in streamlit we experienced a few connection closed errors during the Rendering process. If you experience this error just rerun the script by using the top right menu and clicking on rerun in streamlit.

Contributors

Contributors names and contact info

JR Oakes @jroakes
Robert Padgett @robertcpadgett

Version History

0.1 - Alpha
- Initial Release

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Libraries

ghPublish
pandas # What would we all do without Pandas?
gensim
pyppeteer
scikit-learn
streamlit
DIP # I don't know who you are, but thanks for my go-to text normalization pipeline.

Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

Related tags

Overview

TechSEO Crawler

Description

Web Folder

Main function

Parts:

Getting Started

Get the repo

Dependencies

Executing program

Sharing

Complete example:

Troubleshooting

Contributors

Version History

License

Acknowledgments

Libraries

Topics

Owner

JR Oakes

MAterial del programa Misión TIC 2022

Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes (CVPR 2021 Oral)

[CVPR 2021] 'Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator'

Scrutinizing XAI with linear ground-truth data

Rename Images with Auto Generated Neural Image Captions

Code for reproducible experiments presented in KSD Aggregated Goodness-of-fit Test.

Download & Install mods for your favorit game with a few simple clicks

Official Pytorch implementation of "CLIPstyler:Image Style Transfer with a Single Text Condition"

Tutorial on scikit-learn and IPython for parallel machine learning

ByteTrack(Multi-Object Tracking by Associating Every Detection Box)のPythonでのONNX推論サンプル

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

Yet Another Robotics and Reinforcement (YARR) learning framework for PyTorch.

Reinforcement Learning for the Blackjack

pytorch implementation of ABC : Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning

Official code for MPG2: Multi-attribute Pizza Generator: Cross-domain Attribute Control with Conditional StyleGAN

A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.

Deep Dual Consecutive Network for Human Pose Estimation (CVPR2021)

[NeurIPS 2021] Code for Unsupervised Learning of Compositional Energy Concepts

Source code, data, and evaluation details for “Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Formation, and Ramifications”

This is the repo of the manuscript "Dual-branch Attention-In-Attention Transformer for speech enhancement"