Linkedin webscraping - Linkedin web scraping with python

Overview

linkedin_webscraping

This is the first step of a full project called "LinkedIn Job Posting Analysis" and consists of a data ingestion (Extract and Load) procedure to retrieve information about jobs requirements in the data fields (Data Science, Data Engineering, Data Analysis, etc).

I started by navigating through the LinkedIn jobs page and searching for the desired job keyword using Selenium. After I found a good amount of jobs, I used the BeautifulSoup library to inspect the page and get, from each announced job, the full link for that post. This is our first function, get_links.

Then, looping through that list and using BeautifulSoup I was able to get the Job Title, Company Name, Job Location and Job Description for each job link. After some filtering on the Descriptions list, the data retrieved was put on a dictionary and turned into a Pandas DataFrame. This is our second function, jobs_dataframe, and it returns something like this:

jobs_dataframe

Finally, after some small validation, the data is ready to be stored into a database. For this, I created a SQLite connection and a table using the sqlalchemy library to write SQL in Python. We can see the results in the picture below:

jobs_in_database

Despite we're already able to make some Data Analysis and maybe some Machine Learning using the data we have, I want to stress that this is an ongoing project for some reasons:

  • First, I want to migrate these data from SQLite to a PostgreSQL database (so I can have more freedom to edit it) and create relational tables, using an efficient way to relate them;
  • Second, maybe is it possible to refine a little bit more the description column and normalize all the table;
  • Last but not least, this is just the first step of a bigger project, as I said earlier. So, we'll probably gonna make a lot of changes along the way, even though we may still use the EtLT pattern to do the engineering.

Dependencies

This project was made using Python 3.10.0

Executing

To run this project, in addition to Python, you'll need to have ChromeDriver and SQLite and its libraries for Python installed on your computer or on a virtual environment and chromedriver.exe on your project's folder. Then, run the linkedin_scraper.py file on your terminal window. Next, open the scraping_jobs notebook and substitute the keyword string of your interest on the job_keyword variable. Finally, run all cells and you're ready to open, on your database administration tool (mine's DBeaver), the data you've just got.

Author

Pedro Dib ([email protected])

Thanks

Thanks a lot to Igor Magalhães for the project idea, and for helping me with tips on writing good code and best practices on documentation.

Owner
Pedro Dib
Pedro Dib
Complete pipeline for crawling online newspaper article.

Complete pipeline for crawling online newspaper article. The articles are stored to MongoDB. The whole pipeline is dockerized, thus the user does not need to worry about dependencies. Additionally, d

newspipe 4 May 27, 2022
The core packages of security analyzer web crawler

Security Analyzer 🐍 A large scale web crawler (considered also as vulnerability scanner tool) to take an overview about security of Moroccan sites Cu

Security Analyzer 10 Jul 03, 2022
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 03, 2021
tweet random sand cat pictures

sandcatbot setup pip3 install --user -r requirements.txt cp sandcatbot.example.conf sandcatbot.conf vim sandcatbot.conf running the first parameter i

jess 8 Aug 07, 2022
Works very well and you can ask for the type of image you want the scrapper to collect.

Works very well and you can ask for the type of image you want the scrapper to collect. Also follows a specific urls path depending on keyword selection.

Memo Sim 1 Feb 17, 2022
This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

Tô Đức Anh 5 Nov 22, 2022
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
A Pixiv web crawler module

Pixiv-spider A Pixiv spider module WARNING It's an unfinished work, browsing the code carefully before using it. Features 0004 - Readme.md updated, co

Uzuki 1 Nov 14, 2021
A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

VeNoMouS 2.6k Dec 31, 2022
fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023
An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022
Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

Nafaa BOUGRAINE 3 Jul 01, 2022
UsernameScraperTool - Username Scraper Tool With Python

UsernameScraperTool Username Scraper for 40+ Social sites. How To use git clone

E4crypt3d 1 Dec 20, 2022
This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
Grab the changelog from releases on Github

release-notes-scraper This simple script can be used to grab the release notes for projects from github that do not keep a CHANGELOG, but publish thei

Dan Čermák 4 Apr 01, 2022
Generate a repository with mirror links for DriveDroid app

DriveDroid Repository Generator Generate a repository for the app that allow boot a PC using ISO files stored on your Android phone Check also an offi

Evgeny 11 Nov 19, 2022
Jobinja.ir jobs scraper.

Jobinja.ir Dataset Introduction This project is a simple web scraper that scraps pages of jobinja.ir concurrently and writes and update (if file gets

Iman Kermani 3 Apr 15, 2022
Scrapy-based cyber security news finder

Cyber-Security-News-Scraper Scrapy-based cyber security news finder Goal To keep up to date on the constant barrage of information within the field of

2 Nov 01, 2021
SmartScraper: 简单、自动、快捷的Python网络爬虫

SmartScraper: 简单、自动、快捷的Python网络爬虫 Note: The origin developer of SmartScraper is Alireza Mika, I only change a little code of AutoScraper. SmartScraper

DaDeng 9 Apr 16, 2022
Parse feeds in Python

feedparser - Parse Atom and RSS feeds in Python. Copyright 2010-2020 Kurt McKee Kurt McKee 1.5k Dec 30, 2022