A standalone package to scrape financial data from listed Vietnamese companies via Vietstock

Overview

Scrape Financial Data of Vietnamese Listed Companies - Version 2

A standalone package to scrape financial data from listed Vietnamese companies via Vietstock. If you are looking for raw financial data from listed Vietnamese companies, this may help you.

Table of Contents

Prerequisites

A computer that can run Docker

Because the core components of this project runs on Docker.

Cloning this project

Because you will have to build the image from source. I have not released this project's image on Docker Hub yet.

A Vietstock user cookie string

How to get it:

  • Sign on to finance.vietstock.vn
  • Hover over "Corporate"/"Doanh nghiệp", and choose "Corporate A-Z"/"Doanh nghiệp A-Z"
  • Click on any ticker
  • Open your browser's Inspect console by right-clicking on any empty area of the page, and choose Inspect
  • Go to the Network tab, filter only XHR requests
  • On the page, click "Financials"/"Tài chính"
  • On the list of XHR requests, click on any requests, then go to the Cookies tab underneath
  • Take note of the the string in the vts_usr_lg cookie, which is your user cookie
  • Done

Some pointers about Vietstock financial API parameters, which will be used when scraping

Financial report types and their meanings:

Report type code Meaning
CTKH Financial targets/Chỉ Tiêu Kế Hoạch
CDKT Balance sheet/Cân Đối Kế Toán
KQKD Income statement/Kết Quả Kinh Doanh
LC Cash flow statement/Lưu Chuyển (Tiền Tệ)
CSTC Financial ratios/Chỉ STài Chính

Financial report terms and their meanings:

Report term code Meaning
1 Annually
2 Quarterly

Noting the project folder

All core functions are located within the functions_vietstock folder and so are the scraped files; thus, from now on, references to the functions_vietstock folder will be simply put as ./.

Run within Docker Compose (recommended)

Configuration

1. Add your Vietstock user cookie to docker-compose.yml

It should be in this area:

...
functions-vietstock:
    build: .
    container_name: functions-vietstock
    command: wait-for-it -s torproxy:8118 -s scraper-redis:6379 -t 600  -- bash
    environment: 
        - REDIS_HOST=scraper-redis
        - PROXY=yes
        - TORPROXY_HOST=torproxy
        - USER_COOKIE=<YOUR_VIETSTOCK_USER_COOKIE>
...

2. Specify whether you want to use proxy

In the same config area as the user cookie above, removing the environment variable PROXY and TORPROXY_HOST to stop using proxy. Please note that I have not tested this scraper without proxy.

Build image and start related services

At the project folder, run:

docker-compose build --no-cache && docker-compose up -d

Next, open the scraper container in another terminal:

docker exec -it functions-vietstock ./userinput.sh

From now, you can follow along the userinput script

Note: To stop the scraping, stop the userinput script terminal, then open another terminal and run:

docker exec -it functions-vietstock ./celery_stop.sh

to clean everything related to the scraping process (local scraped files are intact).

Some quesitons require you to answer in a specific syntax, as follows:

  • Do you wish to scrape by a specific business type-industry or by tickers? [y for business type-industry/n for tickers]
    • If you enter y, the next prompt is: Enter business type ID and industry ID combination in the form of businesstype_id;industry_id:
      • If you chose to scrape a list of all business types-industries and their respective tickers, you should have the file bizType_ind_tickers.csv in the scrape result folder (./localData/overview).
      • Then you answer this prompt by entering a business type ID and industry ID combination in the form of businesstype_id;industry_id.
    • If you enter n, the next prompts ask for ticker(s), report type(s), report term(s) and page.
      • Again, suppose you have the bizType_ind_tickers.csv file
      • Then you answer the prompts as follows:
        • ticker: a ticker symbol or a list of ticker symbols of your choice. You can enter either ticker_1 or ticker_1,ticker_2

        • report_type and report_term: use the report type codes and report term codes in the following tables (which was already mentioned above). You can enter either report_type_1 or report_type_1,report_type_2. Same goes for report term.

          Report type code Meaning
          CTKH Financial targets/Chỉ Tiêu Kế Hoạch
          CDKT Balance sheet/Cân Đối Kế Toán
          KQKD Income statement/Kết Quả Kinh Doanh
          LC Cash flow statement/Lưu Chuyển (Tiền Tệ)
          CSTC Financial ratios/Chỉ STài Chính
          Report term code Meaning
          1 Annually
          2 Quarterly
        • page: the page number for the scrape, this is optional. If omitted, the scraper will start from page 1

Run on Host without Docker Compose

Maybe you do not want to spend time building the image, and just want to play around with the code.

Specify local environment variables

At functions_vietstock folder, create a file named .env with the following content:

REDIS_HOST=localhost
PROXY=yes
TORPROXY_HOST=localhost
USER_COOKIE=<YOUR_VIETSTOCK_USER_COOKIE>

Run Redis and Torproxy

You still need to run these inside containers:

docker run -d -p 6379:6379 --rm --name scraper-redis redis

docker run -d -p 8118:8118 -p 9050:9050 --rm --name torproxy --env TOR_NewCircuitPeriod=10 --env TOR_MaxCircuitDirtiness=60 dperson/torproxy

Clear all previous running files (if any)

Go to the functions_vietstock folder:

cd functions_vietstock

Run the celery_stop.sh script:

./celery_stop.sh

User the userinput script to scrape

Use the ./userinput.sh script to scrape as in the previous section.

Scrape Results

CorporateAZ Overview

File location

If you chose to scrape a list of all business types, industries and their tickers, the result is stored in the ./localData/overview folder, under the file name bizType_ind_tickers.csv.

File preview (shortened)

ticker,biztype_id,bizType_title,ind_id,ind_name
BID,3,Bank,1000,Finance and Insurance
CTG,3,Bank,1000,Finance and Insurance
VCB,3,Bank,1000,Finance and Insurance
TCB,3,Bank,1000,Finance and Insurance
...

FinanceInfo

File location

FinanceInfo results are stored in the ./localData/financeInfo folder, and each file is the form ticker_reportType_reportTermName_page.json, representing a ticker - report type - report term - page instance.

File preview (shortened)

[
    [
        {
            "ID": 4,
            "Row": 4,
            "CompanyID": 2541,
            "YearPeriod": 2017,
            "TermCode": "N",
            "TermName": "Năm",
            "TermNameEN": "Year",
            "ReportTermID": 1,
            "DisplayOrdering": 1,
            "United": "HN",
            "AuditedStatus": "KT",
            "PeriodBegin": "201701",
            "PeriodEnd": "201712",
            "TotalRow": 14,
            "BusinessType": 1,
            "ReportNote": null,
            "ReportNoteEn": null
        },
        {
            "ID": 3,
            "Row": 3,
            "CompanyID": 2541,
            "YearPeriod": 2018,
            "TermCode": "N",
            "TermName": "Năm",
            "TermNameEN": "Year",
            "ReportTermID": 1,
            "DisplayOrdering": 1,
            "United": "HN",
            "AuditedStatus": "KT",
            "PeriodBegin": "201801",
            "PeriodEnd": "201812",
            "TotalRow": 14,
            "BusinessType": 1,
            "ReportNote": null,
            "ReportNoteEn": null
        },
        {
            "ID": 2,
            "Row": 2,
            "CompanyID": 2541,
            "YearPeriod": 2019,
            "TermCode": "N",
            "TermName": "Năm",
            "TermNameEN": "Year",
            "ReportTermID": 1,
            "DisplayOrdering": 1,
            "United": "HN",
            "AuditedStatus": "KT",
            "PeriodBegin": "201901",
            "PeriodEnd": "201912",
            "TotalRow": 14,
            "BusinessType": 1,
            "ReportNote": null,
            "ReportNoteEn": null
        },
        {
            "ID": 1,
            "Row": 1,
            "CompanyID": 2541,
            "YearPeriod": 2020,
            "TermCode": "N",
            "TermName": "Năm",
            "TermNameEN": "Year",
            "ReportTermID": 1,
            "DisplayOrdering": 1,
            "United": "HN",
            "AuditedStatus": "KT",
            "PeriodBegin": "202001",
            "PeriodEnd": "202112",
            "TotalRow": 14,
            "BusinessType": 1,
            "ReportNote": null,
            "ReportNoteEn": null
        }
    ],
    {
        "Balance Sheet": [
            {
                "ID": 1,
                "ReportNormID": 2995,
                "Name": "TÀI SẢN ",
                "NameEn": "ASSETS",
                "NameMobile": "TÀI SẢN ",
                "NameMobileEn": "ASSETS",
                "CssStyle": "MaxB",
                "Padding": "Padding1",
                "ParentReportNormID": 2995,
                "ReportComponentName": "Cân đối kế toán",
                "ReportComponentNameEn": "Balance Sheet",
                "Unit": null,
                "UnitEn": null,
                "OrderType": null,
                "OrderingComponent": null,
                "RowNumber": null,
                "ReportComponentTypeID": null,
                "ChildTotal": 0,
                "Levels": 0,
                "Value1": null,
                "Value2": null,
                "Value3": null,
                "Value4": null,
                "Vl": null,
                "IsShowData": true
            },
            {
                "ID": 2,
                "ReportNormID": 3000,
                "Name": "A. TÀI SẢN NGẮN HẠN",
                "NameEn": "A. SHORT-TERM ASSETS",
                "NameMobile": "A. TÀI SẢN NGẮN HẠN",
                "NameMobileEn": "A. SHORT-TERM ASSETS",
                "CssStyle": "LargeB",
                "Padding": "Padding1",
                "ParentReportNormID": 2996,
                "ReportComponentName": "Cân đối kế toán",
                "ReportComponentNameEn": "Balance Sheet",
                "Unit": null,
                "UnitEn": null,
                "OrderType": null,
                "OrderingComponent": null,
                "RowNumber": null,
                "ReportComponentTypeID": null,
                "ChildTotal": 25,
                "Levels": 1,
                "Value1": 4496051.0,
                "Value2": 4971364.0,
                "Value3": 3989369.0,
                "Value4": 2142717.0,
                "Vl": null,
                "IsShowData": true
            },
...

Please note that you have to determine whether the order of the financial values match the order of the periods

Logs

Logs are stored in the ./logs folder, in the form of scrapySpiderName_log_verbose.log.

Debugging and How This Thing Works

What is Torproxy?

Quick introduction

Torproxy is "Tor and Privoxy (web proxy configured to route through tor) docker container." See: https://github.com/dperson/torproxy. We need it in this container to avoid IP-banning for scraping too much.

Configuration used in this project

The only two configuration variables I used with Torproxy are TOR_MaxCircuitDirtiness and TOR_NewCircuitPeriod, which means the maximum Tor circuit age (in seconds) and time period between every attempt to change Tor circuit (in seconds), respectively. Note that TOR_MaxCircuitDirtiness is set at max = 60 seconds, and TOR_NewCircuitPeriod is set at 10 seconds.

What is Redis?

"Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker." See: https://redis.io/. In this project, Redis serves as a message broker and an in-memory queue for Scrapy. No non-standard Redis configurations were made for this project.

Debugging

Redis

If scraper run in Docker container:

To open an interactive shell with Redis, you have to enter the container first:

docker exec -it functions-vietstock bash

Then:

redis-cli -h scraper-redis

If scraper run on host:

To open an interactive shell with Redis:

docker exec -it scraper-redis redis-cli

Celery

Look inside each log file.

How This Scraper Works

This scraper utilizes scrapy-redis and Redis to crawl and scrape tickers' information from a top-down approach (going from business types, then industries, then tickers in each business type-industry combination) by passing necessary information into Redis queues for different Spiders to consume. The scraper also makes use of Torproxy to avoid IP-banning.

Limitations and Lessons Learned

Limitations

  • When talking about a crawler/scraper, one must consider speed, among other things. That said, I haven't run a benchmark for this scraper project.
    • There are about 3000 tickers on the market, each with its own set of available report types, report terms and pages.
    • Scraping all historical financials of all those 3000 tickers will, I believe, be pretty slow, because we have to use Torproxy and there can be many pages for a ticker-report type-report term combination.
    • Scrape results are written on disk, so that is also a bottleneck if you want to mass-scrape. Of course, this point is different if you only scrape one or two tickers.
    • To mass-scrape, a distributed scraping architecture is desirable, not only for speed, but also for anonymity (not entirely if you use the same user cookie across machines). However, one should respect the API service provider (i.e., Vietstock) and avoid bombarding them with tons of requests in a short period of time.
  • Possibility of being banned on Vietstock? Yes.
    • Each request has a unique Vietstock user cookie on it, and thus you are identifiable when making each request.
    • As of now (May 2021), I still don't know how many concurrent requests can Vietstock server handle at any given point. While this API is publicly open, it's not documented on Vietstock. Because of this, I recently added a throttling feature to the financeInfo Spider to avoid bombarding Vietstock's server. See financeInfo's configuration file.
  • Constantly changing Tor circuit maybe harmful to the Tor network.
    • Looking at this link on Tor metrics, we see that the number of exit nodes is below 2000. By changing the circuits as we scrape, we will eventually expose almost all of these available exit nodes to the Vietstock server, which in turn undermines the purpose of avoiding ban.
    • In addition, in an unlikely circumstance, interested users who want to use Tor network to view a Vietstock page may not be able to do so, because the exit node may have been banned.
  • Scrape results are as-is and not processed.
    • As mentioned, scrape results are currently stored on disk as JSONs, and a unified format for financial statements has not been produced. Thus, to fully integrate this scraping process with an analysis project, you must do a lot of data standardization.
  • There is no user-friendly interface to monitor Redis queue, and I haven't looked much into this.

Lessons learned

  • Utilizing Redis creates a nice and smooth workflow for mass scraping data, provided that the paths to data can be logically determined (e.g., in the form of pagination).
  • Using proxies cannot offer the best anonymity while scraping, because you have to use a user cookie to have access to data anyway.
  • Packing inter-dependent services with Docker Compose helps create a cleaner and more professional-looking code base.

Disclaimer

  • This project is completed for educational and non-commercial purposes only.
  • The scrape results are as-is from Vietstock API and without any modification. Thus, you are responsible for your own use of the data scraped using this project.
  • Vietstock has all the rights to modify or remove access to the API used in this project in their own way, without any notice. I am not responsible for updating access to their API in a promptly manner and any consequences to your use of this project resulting from such mentioned change.
Owner
Viet Anh (Vincent) Tran
Viet Anh (Vincent) Tran
mirage ~ ♪ extended django admin or manage.py command.

mirage ~ ♪ extended django admin or manage.py command. ⬇️ Installation Installing Mirage with Pipenv is recommended. pipenv install -d mirage-django-l

Shota Shimazu 6 Feb 14, 2022
Django Audit is a simple Django app that tracks and logs requests to your application.

django-audit Django Audit is a simple Django app that tracks and logs requests to your application. Quick Start Install django-audit pip install dj-au

Oluwafemi Tairu 6 Dec 01, 2022
Django Query Capture can check the query situation at a glance, notice slow queries, and notice where N+1 occurs.

django-query-capture Overview Django Query Capture can check the query situation at a glance, notice slow queries, and notice where N+1 occurs. Some r

GilYoung Song 80 Nov 22, 2022
Simple API written in Python using FastAPI to store and retrieve Books and Authors.

Simple API made with Python FastAPI WIP: Deploy in AWS with Terraform Simple API written in Python using FastAPI to store and retrieve Books and Autho

Caio Delgado 9 Oct 26, 2022
A Django GraphQL (Graphene) base template

backend A Django GraphQL (Graphene) base template Make sure your IDE/Editor has Black and EditorConfig plugins installed; and configure it lint file a

Reckonsys 4 May 25, 2022
Forgot password functionality build in Python / Django Rest Framework

Password Recover Recover password functionality with e-mail sender usign Django Email Backend How to start project. Create a folder in your machine Cr

alexandre Lopes 1 Nov 03, 2021
Application made in Django to generate random passwords as based on certain criteria .

PASSWORD GENERATOR Welcome to Password Generator About The App Password Generator is an Open Source project brought to you by Iot Lab,KIIT and it brin

IoT Lab KIIT 3 Oct 21, 2021
Highlight the keywords of a page if a visitor is coming from a search engine.

Django-SEKH Django Search Engine Keywords Highlighter, is a middleware for Django providing the capacities to highlight the user's search keywords if

Julien Fache 24 Oct 08, 2021
PEP-484 type hints bindings for the Django web framework

mypy-django Type stubs to use the mypy static type-checker with your Django projects This project includes the PEP-484 compatible "type stubs" for Dja

Machinalis 223 Jun 17, 2022
Актуальный сборник шаблонов для создания проектов и приложений на Django

О чем этот проект Этот репозиторий с шаблонами для быстрого создания Django проекта. В шаблоне проекта настроены следующий технологий: Django gunicorn

Denis Kustov 16 Oct 20, 2022
APIs for a Chat app. Written with Django Rest framework and Django channels.

ChatAPI APIs for a Chat app. Written with Django Rest framework and Django channels. The documentation for the http end points can be found here This

Victor Aderibigbe 18 Sep 09, 2022
django CMS Association 1.6k Jan 06, 2023
This "I P L Team Project" is developed by Prasanta Kumar Mohanty using Python with Django web framework, HTML & CSS.

I-P-L-Team-Project This "I P L Team Project" is developed by Prasanta Kumar Mohanty using Python with Django web framework, HTML & CSS. Screenshots HO

1 Dec 15, 2021
:couple: Multi-user accounts for Django projects

django-organizations Summary Groups and multi-user account management Author Ben Lopatin (http://benlopatin.com / https://wellfire.co) Status Separate

Ben Lopatin 1.1k Jan 01, 2023
The pytest framework makes it easy to write small tests, yet scales to support complex functional testing

The pytest framework makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries. An example o

pytest-dev 9.6k Jan 06, 2023
Django friendly finite state machine support

Django friendly finite state machine support django-fsm adds simple declarative state management for django models. If you need parallel task executio

Viewflow 2.1k Dec 31, 2022
Yummy Django API, it's the exclusive API used for the e-yummy-ke vue web app

Yummy Django API, it's the exclusive API used for the e-yummy-ke vue web app

Am.Chris_KE 1 Feb 14, 2022
Median and percentile for Django and MongoEngine

Tailslide Median and percentile for Django and MongoEngine Supports: PostgreSQL SQLite MariaDB MySQL (with an extension) SQL Server MongoDB 🔥 Uses na

Andrew Kane 4 Jan 15, 2022
Store model history and view/revert changes from admin site.

django-simple-history django-simple-history stores Django model state on every create/update/delete. This app supports the following combinations of D

Jazzband 1.8k Jan 08, 2023
A Django app that creates automatic web UIs for Python scripts.

Wooey is a simple web interface to run command line Python scripts. Think of it as an easy way to get your scripts up on the web for routine data anal

Wooey 1.9k Jan 08, 2023