A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

Overview

Fully Distributed CIDACS-RL

The CIDACS-RL is a brazillian record linkage tool suitable to integrate large amount of data with high accuracy. However, its current implementation relies on a ElasticSearch Cluster to distribute the queries and a single node to perform them through Python Multiprocessing lib. This implementation of CIDACS-RL tool can be deployed in a Spark Cluster using all resources available by Jupyter Kernel still using the ElasticSearch cluster, becaming a fully distributed and cluster based solution. It can outperform the legacy version of CIDACS-RL either on multi-node or single node Spark Environment.

config.json

Almost all the aspects of the linkage can be manipulated by the config.json file.

Section Sub-section Field (datatype) Field description
General info index_data (str<'yes', 'no'>) This flag says if the linkage process includes the indexing of a data set into elastic search. Constraints: string, it can assume the values "yes" or "no".
General info es_index_name (str<ES_VALID_INDEX>) The name of an existing elasticsearch index (if index_data is 'no') or a new one (if index_data is 'yes'). Constraints: string, elasticsearch valid.
General info es_connect_string (str<ES_URL:ES_PORT>) Elasticsearch API address. Constraints: string, URL format.
General info query_size (int) Number of candidates output for each Elasticsearch query. Constraints: int.
General info cutoff_exact_match (str<0:1 number>) Cutoff point to determine wether a pair is an exact match or not. Constraints: str, number between 0 and 1.
General info null_value (str) Value to replace missings on both data sets involved. Constraints: string.
General info temp_dir (str) Directory used to write checkpoints for exact match and non-exact match phases. Constraints: string, fully qualified path.
General info debug (str<'true', 'false'>) If it is set as "true", all records found on exact match will be queried again on non-exact match phase.
Datasets info Indexed dataset path (str) Path for csv or parquet folder of dataset to index.
Datasets info Indexed dataset extension (str<'csv', 'parquet'>) String to determine the type of data reading on Spark.
Datasets info Indexed dataset columns (list) Python list with column names involved on linkage.
Datasets info Indexed dataset id_column_name (str) Name of id column.
Datasets info Indexed dataset storage_level (str<'MEMORY_AND_DISK', 'MEMORY_ONLY'>) Directive for memory allocation on Spark.
Datasets info Indexed dataset default_paralelism (str<4*N_OF_AVAILABLE_CORES>) Number of partitions of a given Spark dataframe.
Datasets info tolink dataset path (str) Path for csv or parquet folder of dataset to index.
Datasets info tolink dataset extension (str<'csv', 'parquet'>) String to determine the type of data reading on Spark.
Datasets info tolink dataset columns (list) Python list with column names involved on linkage.
Datasets info tolink dataset id_column_name (str) Name of id column.
Datasets info tolink dataset storage_level (str<'MEMORY_AND_DISK', 'MEMORY_ONLY'>) Directive for memory allocation on Spark.
Datasets info tolink dataset default_paralelism (str<4*N_OF_AVAILABLE_CORES>) Number of partitions of a given Spark dataframe.
Datasets info result dataset path (str) Path for csv or parquet folder of dataset to index.
Comparisons label1 indexed_col (str) Name of first column to be compared on indexed dataset
Comparisons label1 tolink_col (str) Name of first column to be compared on tolink dataset
Comparisons label1 must_match (str<'true', 'false'>) Set if this pair of columns are included on exact match phase
Comparisons label1 should_match (str<'true', 'false'>) Set if this pair of columns are included on non-exact match phase
Comparisons label1 is_fuzzy (str<'true', 'false'>) Set if this pair of columns are included on fuzzy queries for non-exact match phase
Comparisons label1 boost (str) Set the boost/weight of this pair of columns on queries
Comparisons label1 query_type (str<'match', 'term'>) Set the type of matching for this pair of columns on non-exact match phase
Comparisons label1 similarity (str<'jaro_winkler', 'overlap', 'hamming'> Set the similarity to be calculated between the values of this pair of columns
Comparisons label1 weight (str) Set the weight of this pair of columns.
Comparisons label1 penalty (str) Set the penalty of the overall similarity in case of missing value(s).
Comparisons label2 ... ...

config.json example


{
 'index_data': 'no',
 'es_index_name': 'fd-cidacs-rl',
 'es_connect_string': 'http://localhost:9200',
 'query_size': 100,
 'cutoff_exact_match': '0.95',
 'null_value': '99',
 'temp_dir': '../../../0_global_data/fd-cidacs-rl/temp_dataframe/',
 'debug': 'false',
 
 'datasets_info': {
    'indexed_dataset': {
        'path': '../../../0_global_data/fd-cidacs-rl/sinthetic-dataset-A.parquet',
        'extension': 'parquet',
        'columns': ['id_cidacs_a', 'nome_a', 'nome_mae_a', 'dt_nasc_a', 'sexo_a'],
        'id_column_name': 'id_cidacs_a',
        'storage_level': 'MEMORY_ONLY',
        'default_paralelism': '16'},
    'tolink_dataset': {
        'path': '../../../0_global_data/fd-cidacs-rl/sinthetic-datasets-b/sinthetic-datasets-b-500000.parquet',
        'extension': 'parquet',
        'columns': ['id_cidacs_b', 'nome_b', 'nome_mae_b', 'dt_nasc_b', 'sexo_b'],
        'id_column_name': 'id_cidacs_b',
        'storage_level': 'MEMORY_ONLY',
        'default_paralelism': '16'},
    'result_dataset': {
        'path': '../0_global_data/result/500000/'}},
        
 'comparisons': {
    'name': {
        'indexed_col': 'nome_a',
        'tolink_col': 'nome_b',
        'must_match': 'true',
        'should_match': 'true',
        'is_fuzzy': 'true',
        'boost': '3.0',
        'query_type': 'match',
        'similarity': 'jaro_winkler',
        'weight': 5.0,
        'penalty': 0.02},
    'mothers_name': {
       'indexed_col': 'nome_mae_a',
       'tolink_col': 'nome_mae_b',
       'must_match': 'true',
       'should_match': 'true',
       'is_fuzzy': 'true',
       'boost': '2.0',
       'query_type': 'match',
       'similarity': 'jaro_winkler',
       'weight': 5.0,
       'penalty': 0.02},
  'birthdate': {
       'indexed_col': 'dt_nasc_a',
       'tolink_col': 'dt_nasc_b',
       'must_match': 'false',
       'should_match': 'true',
       'is_fuzzy': 'false',
       'boost': '',
       'query_type': 'term',
       'similarity': 'hamming',
       'weight': 1.0,
       'penalty': 0.02},
  'sex': {
       'indexed_col': 'sexo_a',
       'tolink_col': 'sexo_b',
       'must_match': 'true',
       'should_match': 'true',
       'is_fuzzy': 'false',
       'boost': '',
       'query_type': 'term',
       'similarity': 'overlap',
       'weight': 3.0,
       'penalty': 0.02}}}

Running in a Standalone Spark Cluster

Read more: https://github.com/elastic/elasticsearch-hadoop https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html https://search.maven.org/artifact/org.elasticsearch/elasticsearch-spark-30_2.12 If you intend to run this tool into a single node Spark environment, consider to include this in you spark-submit or spark-shell command line


pyspark --packages org.elasticsearch:elasticsearch-spark-30_2.12:7.14.0 --conf spark.es.nodes="localhost" --conf spark.es.port="9200"

If you are running into a Spark Cluster under JupyterHUB kernels, try to add this kernel or edit an existing one:


{
	 "display_name": "Spark3.3",
	  "language": "python",
	   "argv": [
		     "/opt/bigdata/anaconda3/bin/python",
		       "-m",
		         "ipykernel",
			   "-f",
			     "{connection_file}"
			      ],
			       "env": {
				         "SPARK_HOME": "/opt/bigdata/spark",
					   "PYTHONPATH": "/opt/bigdata/spark/python:/opt/bigdata/spark/python/lib/py4j-0.10.9.2-src.zip",
					     "PYTHONSTARTUP": "/opt/bigdata/spark/python/pyspark/python/pyspark/shell.py",
					       "PYSPARK_PYTHON": "/opt/bigdata/anaconda3/bin/python",
					         "PYSPARK_SUBMIT_ARGS": "--master spark://node1.sparkcluster:7077 --packages org.elasticsearch:elasticsearch-spark-30_2.12:7.14.0 --conf spark.es.nodes=['node1','node2'] --conf spark.es.port='9200' pyspark-shell"
						  }
}

Some advices for indexed data and queries

  • Every col should be casted as string (df.withColumn('column', F.col('column').cast(string')))
  • Date type columns will not be proper indexed as string, except if some preprocessing step tranform it from yyyy-MM-dd to yyyyMMdd.
  • All the nodes of elasticsearch cluster must be included on --packages configuration.
  • Term queries are good to well structured variables, such as CPF, dates, CNPJ, etc.
Owner
Robespierre Pita
AI Researcher
Robespierre Pita
PyTorch implementation of UNet++ (Nested U-Net).

PyTorch implementation of UNet++ (Nested U-Net) This repository contains code for a image segmentation model based on UNet++: A Nested U-Net Architect

4ui_iurz1 642 Jan 04, 2023
YOLOPのPythonでのONNX推論サンプル

YOLOP-ONNX-Video-Inference-Sample YOLOPのPythonでのONNX推論サンプルです。 ONNXモデルは、hustvl/YOLOP/weights を使用しています。 Requirement OpenCV 3.4.2 or later onnxruntime 1.

KazuhitoTakahashi 8 Sep 05, 2022
Offical code for the paper: "Growing 3D Artefacts and Functional Machines with Neural Cellular Automata" https://arxiv.org/abs/2103.08737

Growing 3D Artefacts and Functional Machines with Neural Cellular Automata Video of more results: https://www.youtube.com/watch?v=-EzztzKoPeo Requirem

Robotics Evolution and Art Lab 51 Jan 01, 2023
[RSS 2021] An End-to-End Differentiable Framework for Contact-Aware Robot Design

DiffHand This repository contains the implementation for the paper An End-to-End Differentiable Framework for Contact-Aware Robot Design (RSS 2021). I

Jie Xu 60 Jan 04, 2023
Assessing syntactic abilities of BERT

BERT-Syntax Assesing the syntactic abilities of BERT. What Evaluate Google's BERT-Base and BERT-Large models on the syntactic agreement datasets from

Yoav Goldberg 147 Aug 02, 2022
An University Project of Quera Web Crawling.

WebCrawlerProject An University Project of Quera Web Crawling. خزشگر اینستاگرام در این پروژه شما باید با استفاده از کتابخانه های زیر یک خزشگر اینستاگر

Mahdi 3 Aug 12, 2022
Pytorch implementation of AREL

Status: Archive (code is provided as-is, no updates expected) Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement

8 Nov 25, 2022
This is a JAX implementation of Neural Radiance Fields for learning purposes.

learn-nerf This is a JAX implementation of Neural Radiance Fields for learning purposes. I've been curious about NeRF and its follow-up work for a whi

Alex Nichol 62 Dec 20, 2022
Tools for investing in Python

InvestOps Original repository on GitHub Original author is Magnus Erik Hvass Pedersen Introduction This is a Python package with simple and effective

24 Nov 26, 2022
Deep Learning Specialization by Andrew Ng, deeplearning.ai.

Deep Learning Specialization on Coursera Master Deep Learning, and Break into AI This is my personal projects for the course. The course covers deep l

Engen 1.5k Jan 07, 2023
This repository provides data for the VAW dataset as described in the CVPR 2021 paper titled "Learning to Predict Visual Attributes in the Wild"

Visual Attributes in the Wild (VAW) This repository provides data for the VAW dataset as described in the CVPR 2021 Paper: Learning to Predict Visual

Adobe Research 36 Dec 30, 2022
Predict the latency time of the deep learning models

Deep Neural Network Prediction Step 1. Genernate random parameters and Run them sequentially : $ python3 collect_data.py -gp -ep -pp -pl pooling -num

QAQ 1 Nov 12, 2021
Code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2021

The repo provides the code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2

Yuning Mao 18 May 24, 2022
Fuwa-http - The http client implementation for the fuwa eco-system

Fuwa HTTP The HTTP client implementation for the fuwa eco-system Example import

Fuwa 2 Feb 16, 2022
The official implementation of Equalization Loss for Long-Tailed Object Recognition (CVPR 2020) based on Detectron2

Equalization Loss for Long-Tailed Object Recognition Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, Junjie Yan ⚠️ We re

Jingru Tan 197 Dec 25, 2022
Hitters Linear Regression - Hitters Linear Regression With Python

Hitters_Linear_Regression Kullanacağımız veri seti Carnegie Mellon Üniversitesi'

AyseBuyukcelik 2 Jan 26, 2022
Zero-shot Learning by Generating Task-specific Adapters

Code for "Zero-shot Learning by Generating Task-specific Adapters" This is the repository containing code for "Zero-shot Learning by Generating Task-s

INK Lab @ USC 11 Dec 17, 2021
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences an

Microsoft 8k Jan 04, 2023
Code for "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" @ICRA2021

CloudAAE This is an tensorflow implementation of "CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds" Files log:

Gee 35 Nov 14, 2022
Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF)

Graph Convolutional Gated Recurrent Neural Network (GCGRNN) Improved from Graph Convolutional Neural Networks with Data-driven Graph Filter (GCNN-DDGF

Lei Lin 21 Dec 18, 2022