Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Dec 20, 2022

Overview

Spanish Language Models 💃🏻

Corpora 📃

Corpora	Number of documents	Size (GB)
BNE	201,080,084	570GB

Models 🤖

RoBERTa-base BNE: https://huggingface.co/BSC-TeMU/roberta-base-bne
RoBERTa-large BNE: https://huggingface.co/BSC-TeMU/roberta-large-bne
Other models: (WIP)

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

CBOW Word embeddings: https://zenodo.org/record/5044988
Skip-gram Word embeddings: https://zenodo.org/record/5046525

Evaluation ✅

Dataset	Metric	RoBERTa-b	RoBERTa-l	BETO	mBERT	BERTIN
UD-POS	F1	0.9907	0.9901	0.9900	0.9886	0.9904
Conll-NER	F1	0.8851	0.8772	0.8759	0.8691	0.8627
Capitel-POS	F1	0.9846	0.9851	0.9836	0.9839	0.9826
Capitel-NER	F1	0.8959	0.8998	0.8771	0.8810	0.8741
STS	Combined	0.8423	0.8420	0.8216	0.8249	0.7822
MLDoc	Accuracy	0.9595	0.9600	0.9650	0.9560	0.9673
PAWS-X	F1	0.9035	0.9000	0.8915	0.9020	0.8820
XNLI	Accuracy	0.8016	WiP	0.8130	0.7876	WiP

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Legal Language Model

Cite 📣

@misc{gutierrezfandino2021spanish,
      title={Spanish Language Models}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquín Silveira-Ocampo and Casimiro Pio Carrino and Aitor Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Marta Villegas},
      year={2021},
      eprint={2107.07253},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish Language Models 💃🏻

Corpora 📃

Models 🤖

Word embeddings 🔤

Evaluation ✅

Usage example ⚗️

Other Spanish Language Models 👩‍👧‍👦

Cite 📣

Contact 📧

Owner

PlanTL-SANIDAD

UniSpeech - Large Scale Self-Supervised Learning for Speech

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

All the code I wrote for Overwatch-related projects that I still own the rights to.

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Unlimited Call - Text Bombing Tool

Augmenty is an augmentation library based on spaCy for augmenting texts.

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

NLP command-line assistant powered by OpenAI

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

TruthfulQA: Measuring How Models Imitate Human Falsehoods

apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

Pre-Training with Whole Word Masking for Chinese BERT

DiY Oxygen Concentrator based on the OxiKit

Korean stereoypte detector with TUNiB-Electra and K-StereoSet

A Streamlit web app that generates Rick and Morty stories using GPT2.

Harvis is designed to automate your C2 Infrastructure.

The official repository of the ISBI 2022 KNIGHT Challenge

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.