A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Overview

scraper

Udemy Scraper

License Python Chromium

A Web Scraper built with beautiful soup, that fetches udemy course information.

Installation

Virtual Environment

Firstly, it is recommended to install and run this inside of a virtual environment. You can do so by using the virtualenv library and then activating it.

pip install virtualenv

virtualenv somerandomname

Activating for *nix

source somerandomname/bin/activate

Activating for Windows

somerandomname\Scripts\activate

Package Installation

pip install -r requirements.txt

Chrome setup

Be sure to have chrome installed and install the corresponding version of chromedriver. I have already provided a windows binary file. If you want, you can install the linux binary for the chromedriver from its page.

Approach

It is fairly easy to webscrape sites, however, there are some sites that are not that scrape-friendly. Scraping sites, in itself is perfectly legal however there have been cases of lawsuits against web scraping, some companies *cough Amazon *cough consider web-scraping from its website illegal however, they themselves, web-scrape from other websites. And then there are some sites like udemy, that try to prevent people from scraping their site.

Using BS4 in itself, doesn't give the required results back, so I had to use a browser engine by using selenium to fetch the courses information. Initially, even that didn't work out, but then I realised the courses were being fetch asynchronously so I had to add a bit of delay. So fetching the data can be a bit slow initially.

Functionality

As of this commit, the script can search udemy for the search term you input and get the courses link, and all the other overview details like description, instructor, duration, rating, etc.

Here is a json representation of the data it can fetch as of now:-

{
  "query": "The Complete Angular Course: Beginner to Advanced",
  "link": "https://udemy.com/course/the-complete-angular-master-class/",
  "title": "The Complete Angular Course: Beginner to Advanced",
  "headline": "The most comprehensive Angular 4 (Angular 2+) course. Build a real e-commerce app with Angular, Firebase and Bootstrap 4",
  "instructor": "Mosh Hamedani",
  "rating": "4.5",
  "duration": "29.5 total hours",
  "no_of_lectures": "376 lectures",
  "tags": ["Development", "Web Development", "Angular"],
  "no_of_rating": "23,910",
  "no_of_students": "96,174",
  "course_language": "English",
  "objectives": [
    "Establish yourself as a skilled professional developer",
    "Build real-world Angular applications on your own",
    "Troubleshoot common Angular errors",
    "Master the best practices",
    "Write clean and elegant code like a professional developer"
  ],
  "Sections": [
    {
      "name": "Introduction",
      "lessons": [{ "name": "Introduction" }, { "name": "What is Angular" }],
      "no_of_lessons": 12
    },
    {
      "name": "TypeScript Fundamentals",
      "lessons": [
        { "name": "Introduction" },
        { "name": "What is TypeScript?" }
      ],
      "no_of_lessons": 18
    },
    {
      "name": "Angular Fundamentals",
      "lessons": [
        { "name": "Introduction" },
        { "name": "Building Blocks of Angular Apps" }
      ],
      "no_of_lessons": 10
    }
  ],
  "requirements": [
    "Basic familiarity with HTML, CSS and JavaScript",
    "NO knowledge of Angular 1 or Angular 2 is required"
  ],
  "description": "\nAngular is one of the most popular frameworks for building client apps with HTML, CSS and TypeScript. If you want to establish yourself as a front-end or a full-stack developer, you need to learn Angular.\n\nIf you've been confused or frustrated jumping from one Angular 4 tutoria...",
  "target_audience": [
    "Developers who want to upgrade their skills and get better job opportunities",
    "Front-end developers who want to stay up-to-date with the latest technology"
  ],
  "banner": "https://foo.com/somepicture.jpg"
}

Usage

In order to use the scraper, import it as a module and then create a new course class like so-

from udemyscraper import UdemyCourse

This will import the UdemyCourse class and then you can create an instance of it and then pass the search query to it. Prefarably the exact course name.

from udemyscraper import UdemyCourse

javascript_course = UdemyCourse("Javascript course for beginners")

This will create an empty instance of UdemyCourse. To fetch the data, you need to call the fetch_course function.

javascript_course.fetch_course()

Now that you have the course, you can access all of the courses' data as shown here.

print(javascript_course.Sections[2].lessons[1].name) # This will print out the 3rd Sections' 2nd Lesson's name
Comments
  • pip install fails

    pip install fails

    Describe the bug Unable to install udemyscraper via pip install

    To Reproduce ERROR: Cannot install udemyscraper==0.8.1 and udemyscraper==0.8.2 because these package versions have conflicting dependencies.

    The conflict is caused by: udemyscraper 0.8.2 depends on getopt2==0.0.3 udemyscraper 0.8.1 depends on getopt2==0.0.3

    Desktop (please complete the following information):

    • OS: MAC OS
    bug 
    opened by nuggetsnetwork 5
  • udemyscraper timesout

    udemyscraper timesout

    Describe the bug When running the sample code all I get is timeout.

    To Reproduce Steps to reproduce the behavior: Run the sample code from udemyscraper import UdemyCourse

    course = UdemyCourse() course.fetch_course('learn javascript') print(course.title)

    Current behavior Timed out waiting for page to load or could not find a matching course

    OS: MACOS

    bug duplicate 
    opened by nuggetsnetwork 3
  • Switch to browser explicit wait

    Switch to browser explicit wait

    EXPERIMENTAL! Needs Testing.

    time.sleep() introduces a necessary wait, even if the page has already been loaded.

    By using expected_components, we can proceed as and when the element loads. Using the python time library, I calculated the time taken by search and course page to load to be 2 seconds (approx.)

    Theoretically speaking, after the change, execution time should have reduced by 5 seconds. (3+4-2) However, the gain was only 3 seconds instead of the expected 5.

    This behavior seems unexpected for the moment, unless we can find where the missing 2 seconds are. For a reference, the original version, using time.sleep() took 17 seconds to execute.

    (All times are measured for my internet connection, by executing the given example in readme file)

    Possibly need to dig further. I haven't yet got the time to read full code.

    bug optimization 
    opened by flyingcakes85 3
  • Use explicit wait for search query

    Use explicit wait for search query

    Here 4 seconds have been hardcoded, it will be better to wait for the search results to load and then get the source code.

    A basic method to do this would be to check if search element is visible or not, once its visible, it can proceed to fetch source code. This way if you have a really fast connection, you wouldn't need to wait longer and vice-versa.

    bug optimization 
    opened by sortedcord 3
  • Classes Frequently Keep Changing

    Classes Frequently Keep Changing

    It seems that on the search page, the classes of the elements keep changing. So it would be best to only fetch the course url and then fetch all the other data from the course page itself.

    bug 
    opened by sortedcord 3
  • Serialize to xml

    Serialize to xml

    Experimental!!

    Export the entire dictionary to a xml file using the dict2xml library.

    • [x] Make branch even with refractor base
    • [x] Switch to dicttoxml from dict2xml
    • [x] Object arrays of sections and lessons are not grouped under one root called Sections or Lessons. This is also the case for all of the other arrays.
    • [x] Rename List item
    • [x] Rename root tag to course
    enhancement area: module 
    opened by sortedcord 2
  • Automatically fetch drivers

    Automatically fetch drivers

    Setup a way to automatically fetch browser drivers based on user's choice (chromium/firefox) corresponding to the installed browser version.

    The hard part will be to find the version of the browser installed.

    enhancement help wanted 
    opened by sortedcord 2
  • Timed out waiting for page to load or could not find a matching course

    Timed out waiting for page to load or could not find a matching course

    Whenever I try to scrape a course from udemy I get this error-

    on 1: Timed out waiting for page to load or could not find a matching course
    Scraping Course |████████████████████████████████████████| 1 in 29.5s (0.03/s)
    

    It was working a couple of times before but now it doesn't work..

    Steps to reproduce the behavior:

    1. This happens both when using the script and the module
    2. I used query argument
    3. Output- image

    Desktop (please complete the following information):

    • OS: Windows 10
    • Browser: Chromium
    • Version: 92

    I did checked by manually opening chromium and searching for the course. But when I use the scraper, it doesn't work.

    bug good first issue wontfix area: module 
    opened by sortedcord 1
  • Optimize element search

    Optimize element search

    Some tests have shown that it is way more efficient to use css selectors than find, especially with nested finds which tend to be wayyy more slow and time consuming. It would be better replace all of the find statements with select and then use direct path.

    optimization 
    opened by sortedcord 1
  • 🌐 Added browser selection argument

    🌐 Added browser selection argument

    Instead of editing the source code to select which browser you would like to use, you can now specify the same while initializing the UdemyCourse class or by simply using an argument when using the standalone way.

        -b  --browser       Allows you to select the browser you would like to use for Scraping
                            Values: "chrome" or "firefox". Defaults to chrome if no argument is passed.
    

    Also provided a gekodriver.exe binary.

    enhancement optimization 
    opened by sortedcord 1
  • Implementation of Command Line Arguments

    Implementation of Command Line Arguments

    I assume that the main udemyScraper.py file will be used as a module, so instead I made another file main.py which can be used for such operations. As of now only some basic arguments have been added. Will add more in the future.

        -h  --help          Displays information about udemyscraper and its usage
        -v  --version       Displays the version of the tool
        -n  --no-warn       Disables the warning when initializing the udemyscourse class
    
    enhancement 
    opened by sortedcord 1
Releases(0.8.2)
  • 0.8.2(Oct 2, 2021)

  • Beta(Aug 29, 2021)

    The long awaited (atleast by me) distribution update for udemyscraper. Find this project on PyPI - https://pypi.org/project/udemyscraper/

    Added

    • Udemyscraper can now export multiple courses to csv files!

    • course_to_csv takes an array as an input and dumps each course to a single csv file.
    • Udemyscraper can now export courses to xml files!

    • course_to_xml is function that can be used to export the course object to an xml file with the appropriate tags and format.
    • udemyscraper.export submodule for exporting scraped course.
    • Support for Microsoft Edge (Chromium Based) browser.
    • Support for Brave Browser.

    Changes

    • Udemyscraper.py has been refractured into 5 different files:

      • __init__.py - Contains the code which will run when imported as a library
      • metadata.py - Contains metadata of the package such as the name, version, author, etc. Used by setup.py
      • output.py - Contains functions for outputting the course information.
      • udscraperscript.py -Is the script file which will run when you want to use udemyscraper as a script.
      • utils.py - Contains utility related functions for udemyscraper.
    • Now using udemyscraper.export instead of udemyscraper.output.

      • quick_display function has been replaced with print_course function.
    • Now using setup.py instead of setup.cfg

    • Deleted src folder which is now replaced by udemyscraper folder which is the source directory for all the modules

    • Installation Process

      Since udemyscraper is now to be used as a package, it is obvious that the installation process has also had major changes.

      Installation process is documented here

    • Renamed the browser_preference key in Preferences dictionary to browser

    • Relocated browser determination to utils as set_browser function.

    • Removed requirements.txt and pyproject.toml

    Fixed

    • Fixed cache argument bug.
    • Fixed importing preferences bug.
    • Fixed Banner Image scraping.
    • Fixed Progressbar exception handling.
    • Fixed recognition of chrome as a valid browser.
    • Preferences will not be printed while using the script.
    • Fixed browser key error
    Source code(tar.gz)
    Source code(zip)
    udemyscraper-0.8.1-py3-none-any.whl(31.19 KB)
    udemyscraper-0.8.1.tar.gz(4.87 MB)
Owner
Aditya Gupta
🎓 Student🎨 Front end Dev & Part time weeb ϞϞ(๑⚈ ․̫ ⚈๑)∩
Aditya Gupta
jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

35 Nov 18, 2022
Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

Sheryar 10 Aug 07, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021
Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

윤성도 1 Nov 10, 2021
simple http & https proxy scraper and checker

simple http & https proxy scraper and checker

Neospace 11 Nov 15, 2021
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

Gerapy 2.9k Jan 03, 2023
Subscrape - A Python scraper for substrate chains

subscrape A Python scraper for substrate chains that uses Subscan. Usage copy co

ChaosDAO 14 Dec 15, 2022
Grab the changelog from releases on Github

release-notes-scraper This simple script can be used to grab the release notes for projects from github that do not keep a CHANGELOG, but publish thei

Dan Čermák 4 Apr 01, 2022
Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

WebScrapperRoBot Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup. Mark your Star ⭐ ⭐ What is Web Scraping ? Web s

Nuhman Pk 53 Dec 21, 2022
script to scrape direct download links (ddls) from google drive index.

bhadoo Google Personal/Shared Drive Index scraper. A small script to scrape direct download links (ddls) of downloadable files from bhadoo google driv

sαɴᴊɪᴛ sɪɴʜα 53 Dec 16, 2022
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
This is a sport analytics project that combines the knowledge of OOP and Webscraping

This is a sport analytics project that combines the knowledge of Object Oriented Programming (OOP) and Webscraping, the weekly scraping of the English Premier league table is carried out to assess th

Dolamu Oludare 1 Nov 26, 2021
京东茅台抢购 2021年4月最新版

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。 huanghyw 对任何脚本问题概不

45 Dec 14, 2022
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Scrapy project 859 Dec 29, 2022
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 06, 2022
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Jan 24, 2022
A simple flask application to scrape gogoanime website.

gogoanime-api-flask A simple flask application to scrape gogoanime website. Used for demo and learning purposes only. How to use the API The base api

1 Oct 29, 2021
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

About 千葉県の地域別の詳細感染者統計(Excelファイル) をCSVに変換し、かつ地域別の日時感染者集計値を出力するスクリプトです。 Requirement POSIX互換なシェル, e.g. GNU Bash (1) curl (1) python = 3.8 pandas = 1.1.

Conv4Japan 1 Nov 29, 2021