thailand-budget-pdf2csv

Let's create a tool to convert Thailand Government Budgeting from PDF to CSV!

รวมพลัง Dev แปลงงบ จาก PDF สู่ Machine-readable

เพื่อการตรวจสอบงบประมาณแผ่นดินที่ง่ายมากขึ้น

Usage

PDF -> TXT

You can download the results and see the source code in each approach under ./txt-extraction folder, or, just download output files from shortcut links below:

tee4cute-gcloud-vision: Google Drive folder.

TXT -> CSV

You can download the results and see the source code in each approach under ./csv-extraction folder, or, just download output files from shortcut links below:

napatswift-coordintes: Google Drive folder.

Translations

English version

napatswift-coordintes (partially translated using Google Translation API): Google Sheet, see @asiripanich's repo for code.

Let's Code!

Download source budget PDF files from budget-pdf (เล่มขาวคาดแดง) and do some secret magics to generate output csv files with exepcted format below:

Expected Output Format (V2)

Field Name	Formal Thai Name	Data Type / Format	Description	Since Version
`ITEM_ID`	-	str / [`REF_DOC`].[RUNNING_NO]	Unique Id ของแต่ละ row, สำหรับ `REF_DOC` = ดูที่ field `REF_DOC`, RUNNING_NO = เลข running no ของแต่ละ row ในเล่มงบ (pdf) ไฟล์นั้น ๆ	v1
`REF_DOC`	-	str / [FY].[ฉบับ].[เล่ม]	เลขที่เอกสารเล่มงบ (pdf), [FY]=ปีงบประมาณของเล่มงบ, [ฉบับ]=ฉบับที่, [เล่ม]=เล่มที่ (บางเล่มจะมีวงเล็บต่อท้ายด้วย)	v1
`REF_PAGE_NO`	-	int	หน้าของเอกสารในเล่มงบที่แสดงอยู่บริเวณหัวกระดาษของ row นั้น (โปรดระวัง! เกือบทุกกรณี หน้าเอกสารจะไม่ใช่ pdf page)	v1
`MINISTRY`	กระทรวง/หน่วยงานเทียบเท่ากระทรวง	str		v1
`BUDGETARY_UNIT`	หน่วยรับงบประมาณ	str	ส่วนใหญ่เป็นกรม/หน่วยงานเทียบเท่ากรม	v1
`CROSS_FUNC?`		bool	เป็น row (งบประมาณ) ภายใต้แผนงานบูรณาการ ใช่หรือไม่?, แผนงานบูรณาการ หมายถึง แผนงานที่มีชื่อขึ้นต้นด้วยคำว่า "แผนงานบูรณาการ", See: `BUDGET_PLAN`	v1
`BUDGET_PLAN`	แผนงาน	str	ชื่อแผนงานตาม พ.ร.บ.วิธีการงบประมาณฯ	v1
`OUTPUT`	ผลผลิต	str	ภายใต้แผนงานจะมี `0-n` ผลผลิต/โครงการ, 1 row จะสามารถอยู่ภายใต้ 1 ผลผลิต `XOR` 1 โครงการ อย่างใดอย่างหนึ่ง	v1
`PROJECT`	โครงการ	str	ภายใต้แผนงานจะมี `0-n` ผลผลิต/โครงการ, 1 row จะสามารถอยู่ภายใต้ 1 ผลผลิต `XOR` 1 โครงการ อย่างใดอย่างหนึ่ง	v1
`CATEGORY_LV1`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-1` จะประกอบไปด้วย งบบุคลากร, งบดำเนินงาน, งบลงทุน, งบเงินอุดหนุน, งบรายจ่ายอื่น เท่านั้น (ยกเว้น "งบกลาง" ที่อาจมีรายการอื่น ๆ นอกเหนือจากนี้ได้)	v1
`CATEGORY_LV2`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-2`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV3`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-3`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV4`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-4`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV5`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-5`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV6`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-6`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`ITEM_DESCRIPTION`	-	str	ชื่อรายการ, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `(x)`, บาง row อาจไม่มี `ITEM_DESCRIPTION` ก็ได้	v1
`FISCAL_YEAR`	ปีงบประมาณ	str / ปี ค.ศ.	มีโอกาสที่ 1 line item อาจมีหลาย row ได้หากรายการนั้นเป็นรายการ งบผูกพัน	v1
`AMOUNT`	-	float	จำนวนเงินงบประมาณ	v1
`OBLIGED?`	-	bool	มีค่าเป็น TRUE ก็ต่อเมื่อ เป็น line item ที่มีข้อมูลหลาย row `FISCAL_YEAR`	v1
`DEBUG_LOG`	-	str	Log message สำหรับแจ้ง error ที่เกิดขึ้นระหว่างการ extract row นั้น ๆ	v2

Note: Please see output example in output_example_vx.xlsx and output_example_vx.csv at repository root.

Release Notes

29 Jul 2021

Send messages to DEBUG_LOG to cleary inform user about the source of error where it was orignated from: Syntactic Error or OCR Error.
- Invalid CATEGORY_LV1 values will be reported in DEBUG_LOG as follows: "CATEGORY_LV1 is not as described". issue#15-comment
- Invalid AMOUNT values will be reported in DEBUG_LOG as follows: "AMOUNT FORMAT IS WRONG".

25 Jul 2021

Fix some of Syntactic Errors reported by issue#15.
Fix Compiler Error for wrong AMOUNT output on obliged item written in "XXXX - YYYY ZZZZ บาท" format.
- For example, if the obliged entry is written as "2562 - 2564 30,000,000 บาท", the output will be:
```
  2562    10,000,000
  2563    10,000,000
  2564    10,000,000
```
  instead of
```
  2562    30,000,000
  2563    30,000,000
  2564    30,000,000
```
Sending OCR Error reported by issue#11 to DEBUG_LOG to make it clear that the error was originated from the OCR Tool and needed to be cleaned by hand.

21 Jul 2021

First version release
You can download the first version in CSV format here.

Powered by This Dataset

Budget Overview by korlan rayong

https://public.tableau.com/app/profile/korlan.rayong2953/viz/OverviewBudget65/Dashboard1
2022 Thai Budget Structure by Thanawit Prasongpongchai

Visualization: https://taepras.github.io/thaibudget65 Repository: https://github.com/taepras/thaibudget65

Talk

"ก้าวGeek Community", Line Group: http://line.me/ti/g/STUxfMX87U

Let's create a tool to convert Thailand budget from PDF to CSV.

Related tags

Overview

thailand-budget-pdf2csv

Let's create a tool to convert Thailand Government Budgeting from PDF to CSV!

Usage

PDF -> TXT

TXT -> CSV

Translations

English version

Let's Code!

Expected Output Format (V2)

Release Notes

29 Jul 2021

25 Jul 2021

21 Jul 2021

Powered by This Dataset

Talk

Owner

Kao.Geek

A higher performance pytorch implementation of DeepLab V3 Plus(DeepLab v3+)

Implementation of SwinTransformerV2 in TensorFlow.

PyTorch(Geometric) implementation of G^2GNN in "Imbalanced Graph Classification via Graph-of-Graph Neural Networks"

XViT - Space-time Mixing Attention for Video Transformer

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

A script depending on VASP output for calculating Fermi-Softness.

Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

[BMVC2021] "TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation"

Official Implementation of Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020)

CC-GENERATOR - A python script for generating CC

基于Pytorch实现优秀的自然图像分割框架！(包括FCN、U-Net和Deeplab)

Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 130+ Indicators

Near-Duplicate Video Retrieval with Deep Metric Learning

Compare neural networks by their feature similarity

DeiT: Data-efficient Image Transformers

nnFormer: Interleaved Transformer for Volumetric Segmentation

An optimization and data collection toolbox for convenient and fast prototyping of computationally expensive models.

Code for LIGA-Stereo Detector, ICCV'21

Repo for the Video Person Clustering dataset, and code for the associated paper