bigdata_analyse 大数据分析项目

Last update: Dec 30, 2022

Related tags

Data Analysis bigdata_analyse

Overview

bigdata_analyse

大数据分析项目

wish

采用不同的技术栈，通过对不同行业的数据集进行分析，期望达到以下目标：

了解不同领域的业务分析指标
深化数据处理、数据分析、数据可视化能力
增加大数据批处理、流处理的实践经验
增加数据挖掘的实践经验

tip

项目主要使用的编程语言是 python、sql、hql
.ipynb 可以用 jupyter notebook 打开，如何安装, 可以参考 jupyter notebook

jupyter notebook 是一种网页交互形式的 python 编辑器，直接通过 pip 安装，也支持 markdown，很适合用来做数据分析可视化以及写文章、写示例代码等。

list

主题	处理方式	技术栈	数据集下载
1 亿条淘宝用户行为数据分析	离线处理	清洗 hive + 分析 hive + 可视化 echarts	阿里云或者百度网盘提取码：5ipq
1000 万条淘宝用户行为数据实时分析	实时处理	数据源 kafka + 实时分析 flink + 可视化（es + kibana）	百度网盘提取码：m4mc
300 万条《野蛮时代》的玩家数据分析	离线处理	清洗 pandas + 分析 mysql + 可视化 pyecharts	百度网盘提取码：paq4
130 万条深圳通刷卡数据分析	离线处理	清洗 pandas + 分析 impala + 可视化 dbeaver	百度网盘提取码：t561
10 万条厦门招聘数据分析	离线处理	清洗 pandas + 分析 hive + 可视化 ( hue + pyecharts ) + 预测 sklearn	百度网盘提取码：9wx0
7000 条租房数据分析	离线处理	清洗 pandas + 分析 sqlite + 可视化 matplotlib	百度网盘提取码：9en3
6000 条倒闭企业数据分析	离线处理	清洗 pandas + 分析 pandas + 可视化 (jupyter notebook + pyecharts)	百度网盘提取码：xvgm

refer

https://tianchi.aliyun.com/dataset/

https://opendata.sz.gov.cn/data/api/toApiDetails/29200_00403601

https://www.kesci.com/home/dataset

Owner

Way

Way

GitHub Repository

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

53 Dec 08, 2022

Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

2 Apr 27, 2022

Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

2 Dec 02, 2021

A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

9 Sep 15, 2022

Average time per match by division

HW_02 Unzip matches.rar to access .json files for matches. Get an API key to access their data at: https://developer.riotgames.com/ Average time per m

11 Jan 07, 2022

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

1 Dec 09, 2021

Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

3 Jul 05, 2022

Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

866 Dec 16, 2022

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

2.2k Dec 28, 2022

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle

1 Oct 26, 2021

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

971 Dec 25, 2022

PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022

This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

1 Jan 24, 2022

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

457 Dec 20, 2022

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI Hallo

1 Feb 07, 2022

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

4 Jan 23, 2022

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

27 Dec 28, 2022

BErt-like Neurophysiological Data Representation

BENDR BErt-like Neurophysiological Data Representation This repository contains the source code for reproducing, or extending the BERT-like self-super

114 Dec 23, 2022

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020] by Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wa

112 Dec 28, 2022