GitHub

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at lperez@ed.ac.uk for any question.

Please cite this paper if you use our code or data.

@InProceedings{clads-emnlp,
  author =      "Laura Perez-Beltrachini and Mirella Lapata",
  title =       "Models and Datasets for Cross-Lingual Summarisation",
  booktitle =   "Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing ",
  year =        "2021",
  address =     "Punta Cana, Dominican Republic",
}

The XWikis Corpus

Our XWikis corpus is now on HuggingFace datasets. Follow this link to find all language subsets available for download. Thank you to Ronald Cardenas for helping to upload to HF and Huajian Zhang and Guangyu Li for adding Chinese subsets.

The original XWikis corpus is available at XWikis Corpus.

Instructions to re-create our corpus and extract different languages are available here.

Cross-lingual Summarisation Code

Our code is based on Fairseq and mBART/mBART50. You'll find our clone of Fairseq and the code extension to implement our models here and instructions to pre-process the data, and train and evaluate our models here.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
XWikis-Corpus		XWikis-Corpus
fairseq2020		fairseq2020
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XWikis-Corpus

XWikis-Corpus

fairseq2020

fairseq2020

LICENSE

LICENSE

README.md

README.md

Repository files navigation

The XWikis Corpus

Cross-lingual Summarisation Code

Models' Outputs

About

Releases

Packages

Languages

License

lauhaide/clads

Folders and files

Latest commit

History

Repository files navigation

The XWikis Corpus

Cross-lingual Summarisation Code

Models' Outputs

About

Resources

License

Stars

Watchers

Forks

Languages