Python programming language Test

Overview

Exercise

You are tasked with creating a data-processing app that pre-processes and enriches the data coming from crawlers, with the following requirements.

  • INPUT: csv-like data submitted by crawlers
  • OUTPUT: clean data saved into mongodb collections
  1. The app is an HTTP API server. Every year, crawlers will submit the data saved in a file, using the API endpoint designed by you.
  2. Examples of data the crawlers will submit every year: see data-2018.txt, data-2019.txt, data-2020.txt. You can't change the format of the data.
  3. As you can see, the data coming from crawlers is not 100% well-structured, the API should parse it correctly.
  4. a repeated submission with the data of the same year should perform an update on the existing yearly data.
  5. if there is any error in the submission or processing, the API should return a proper error message with proper HTTP response status
  6. For each university, enrich the data with a URL and a text description of it using Duckduckgo API e.g. https://api.duckduckgo.com/?q=harvard&format=json&pretty=1
  7. The app inserts or updates clean data in 2 mongodb tables/collections:
  • table 1 - the yearly data table
  • table 2 - the universities info
  1. table 1 contains data from every year, table 2 contains only the latest data.
  2. the data processing and transformations should be covered by tests.
  3. The solution should be in Python programming language, however you may use any 3rd party library you like.

Feel free to clarify the requirements further, if you have any doubts.

Bonus (Optional)

If you have indicated any DevOps skillsets in your resume, please create a Dockerfile, and using docker deploy the web app onto a free cloud platform, such as Heroku.

How the solution is assessed

The criteria are as follows (descending importance)

  1. Your code should perform the functionalities required
  2. Your code should be well-covered by tests
  3. Your code should be modular, readable and maintenable by other engineers.
  4. Your code should be robust, and can handle failure such as missing field, disconnection from DB or external server.
  5. Your code should be efficient and fast.
  6. Your code should be pretty.
Owner
Monirul Islam Khan
Database Engineer | DBA | Data Analyst | Data Scientist | Dev Ops
Monirul Islam Khan
A python package for bitclout.

BitClout.py A python package for bitclout. Developed by ItsAditya Run pip install bitclout to install the module! Examples of How To Use BitClout.py G

ItsAditya 9 Dec 31, 2021
Async Python Circuit Breaker implementation

aiocircuitbreaker This is an async Python implementation of the circuitbreaker library. Installation The project is available on PyPI. Simply run: $ p

5 Sep 05, 2022
Get a link to the web version of a git-tracked file or directory

githyperlink Get a link to the web version of a git-tracked file or directory. Applies to GitHub and GitLab remotes (and maybe others but those are no

Tomas Fiers 2 Nov 08, 2022
LOL英雄联盟云顶之弈挂机刷代币脚本,全自动操作,智能逻辑,功能齐全。

LOL云顶之弈挂机刷代币脚本 这是2019年全球总决赛写的一个云顶挂机脚本,python完成的。 功能: 自动拿牌卖牌 策略是高星策略,非固定阵容 自动登陆账号、打码、异常重启 战利品截图上传百度云 web中控发号,改密码,查看信息等 代码是三天赶出来的,所以有点混乱,WEB中控代码也不知道扔哪去了

77 Oct 10, 2022
Location of public benchmarking; primarily final results

CSL_public_benchmark This repo is intended to provide a periodically-updated, public view into genome sequencing benchmarks managed by HudsonAlpha's C

HudsonAlpha Institute for Biotechnology 15 Jun 13, 2022
JLC2KICAD_lib is a python script that generate a component library for KiCad from the JLCPCB/easyEDA library.

JLC2KiCad_lib is a python script that generate a component library (schematic, footprint and 3D model) for KiCad from the JLCPCB/easyEDA library. This script requires Python 3.6 or higher.

Nicolas Toussaint 73 Dec 26, 2022
BloodCheck enables Red and Blue Teams to manage multiple Neo4j databases and run Cypher queries against a BloodHound dataset.

BloodCheck BloodCheck enables Red and Blue Teams to manage multiple Neo4j databases and run Cypher queries against a BloodHound dataset. Installation

Mr B0b 16 Nov 05, 2021
Sheet2export - FreeCAD macro to export spreadsheet

Description This is FreeCAD macro to export spreadsheet to file.

Darek L 3 Jul 09, 2022
Automated moth pictures for biodiversity research

Automated moth pictures for biodiversity research

Ludwig Kürzinger 1 Dec 16, 2021
Auto check in via GitHub Actions

因为本人毕业离校,本项目交由在校的@hfut-xyc同学接手,请访问hfut-xyc/hfut_auto_check-in获得最新的脚本 本项目遵从GPLv2协定,Copyright (C) 2021, Fw[a]rd 免责声明 根据GPL协定,我、本项目的作者,不会对您使用这个脚本带来的任何后果

Fw[a]rd 3 Jun 27, 2021
A general illumination correction method for optical microscopy.

CIDRE About CIDRE is a retrospective illumination correction method for optical microscopy. It is designed to correct collections of images by buildin

Kevin Smith 31 Sep 07, 2022
A little tool that uses LLVM to extract simple "what does this do" level instruction information from all architectures.

moirai: MOre InstRuctions and Information Backcronym. Anyway, this is a small project to extract useful instruction definitions from LLVM's platform d

2 Jul 30, 2022
A set of tools for ripping music from Konami mobile games

Konami Mobile Ripping Toolset A set of tools for ripping music from Konami mobile games Contents nigger.py for niggering konami's website, ripping all

5 Oct 20, 2022
Always fill your package requirements without the user having to do anything! Simple and easy!

WSL Should now work always-fill-reqs-python3 Always fill your package requirements without the user having to do anything! Simple and easy! Supported

Hashm 7 Jan 19, 2022
Minitel 5 somewhat reverse-engineered

Minitel 5 The Minitel was a french dumb terminal with an embedded modem which had its Golden Age before the rise of Internet. Typically cubic, with an

cLx 10 Dec 28, 2022
Compress .dds file in ggpk to boost fps. This is a python rewrite of PoeTexureResizer.

PoeBooster Compress .dds file in ggpk to boost fps. This is a python rewrite of PoeTexureResizer. Setup Install ImageMagick-7.1.0. Download and unzip

3 Sep 30, 2022
TriOTP, the OTP framework for Python Trio

TriOTP, the OTP framework for Python Trio See documentation for more informations. Introduction This project is a simplified implementation of the Erl

David Delassus 7 Nov 21, 2022
🏆 A ranked list of awesome Python open-source libraries and tools. Updated weekly.

Best-of Python 🏆 A ranked list of awesome Python open-source libraries & tools. Updated weekly. This curated list contains 230 awesome open-source pr

Machine Learning Tooling 2.7k Jan 03, 2023
Simple Wayland HotKey Daemon

swhkd Simple Wayland HotKey Daemon This project is still very new and I'm making new decisions everyday as to where I should drive this project. I'm u

Aakash Sen Sharma 407 Dec 30, 2022
A carrot-based color palette you didn't know you needed.

A package to produce a carrot-inspired color palette for python/matplotlib. Install: pip install carrotColors Update: pip install --upgrade carrotColo

10 Sep 28, 2021