CLARIN Tool Portal

WiKNN Text Classifier

2 resources

WiKNN is an online text classifier service for Polish and English texts. It supports hierarchical labelled classification of user-submitted texts with Wikipedia categories. WiKNN is available through a web-based interface (http://pelcra.clarin-pl.eu/tools/classifier/) and as a REST service with interactive documentation available at http://clarin.pelcra.pl/apidocs/wiknn.

Use "WiKNN Text Classifier"

Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)"

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data (https://hdl.handle.net/11234/1-5787). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_215_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)"

CUBBITT Translation Models (en-pl) (v1.0)

3 resources

CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "CUBBITT Translation Models (en-pl) (v1.0)"

CorpoGrabber

1 resources

CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.

Use "CorpoGrabber"

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

2 resources

The `corpipe23-corefud1.2-240906` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 <https://github.com/ufal/crac2023-corpipe>. It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no corpus id on input), so it can be in theory used to predict coreference in any `mT5` language. However, the model expects empty nodes to be already present on input, predicted by the https://www.kaggle.com/models/ufal-mff/crac2024_zero_nodes_baseline/. This model was present in the CorPipe 24 paper as an alternative to a single-stage approach, where the empty nodes are predicted joinly with coreference resolution (via http://hdl.handle.net/11234/1-5672), an approach circa twice as fast but of slightly worse quality.

Use "CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)"

Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data (https://hdl.handle.net/11234/1-5150). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)"

UDify Pretrained Model

3 resources

Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.

Use "UDify Pretrained Model"

CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)

2 resources

The `corpipe24-corefud1.2-240906` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 24 (https://github.com/ufal/crac2024-corpipe). It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no corpus id on input), so it can be in theory used to predict coreference in any `mT5` language. This model jointly predicts also the empty nodes needed for zero coreference. The paper introducing this model also presents an alternative two-stage approach first predicting empty nodes (via https://www.kaggle.com/models/ufal-mff/crac2024_zero_nodes_baseline/) and then performing coreference resolution (via http://hdl.handle.net/11234/1-5673), which is circa twice as slow but slightly better.

Use "CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)"

Parallel Corpora from Comparable Corpora tool

2 resources

Script consists of 2 parts: article parser aligner Required software (install before using script): yalign additional Ubuntu packages: mongodb ipython python-nose python-werkzeug Wiki article parser Article parser works in 2 steps: Extracts articles from wiki dumps Saves extracted articles to local DB (Mongo DB) Before using parser, wiki dumps should be downloaded and extracted to some directory (directory should contain *.xml, *.sql files). For each language 2 dump files should be downloaded - articles and language link dumps, here is examples: PL: http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2 http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-langlinks.sql.gz EN: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-langlinks.sql.gz IMPORTANT NOTE: Engilsh dumps after extraction will require about 50 Gb of free space. During parsing parser can require up to 8 Gb ram. Article parser have option "main language" - its language for which articles extracted from other languages only if it exist in main language. Eg. if main language is PL, then article extractor first extracts all article for PL, then article for other languages and only if such articles exists in PL translation. This reduces space requirements. For help use: $ python parse_wiki_dumps.py -h Example command: $ python parse_wiki_dumps.py -d ~/temp/wikipedia_dump/ -l pl -v Wikipedia aligner Aligner can be used when article extracted from dumps. Aligner takes article pairs for given language pair, aligns text and saves parallel corpara to 2 files. Option "-s" can be used to limit number of symbols in file (by default size is 50000000 symbols, thats around 50-60Mb) By default aligner tries to continue aligning where it was stopped, to force aligning from begining need to use "--restart" key For help use: $ python align.py -h Example command: $ python align.py -o wikipedia -l en-pl -v Euronews crawler Crawler finds links to articles using euronews archive http://euronews.com/2004/, and in parallel extracts and saves article texts to DB. For help use: $ python parse_euronews.py -h Example command: $ python parse_euronews.py -l en,pl -v Euronews aligner Starting aligner for euronews articles: $ python align.py -o euronews -l en-pl -v Saving articles in plain text Script "save_plain_text.py" can be used to save all articles in plain text format, it accepts path for saving articles, languages of articles to be saved, and source of articles (euronews, wikipedia). For help use: $ python save_plain_text.py -h Example command: $ python save_plain_text.py -l en,pl -r [path] -o euronews Yalign selection This script tries random parameters for model of yalign in order to get best parameters for aligning provided text samples. Before using yalign_selection script need to prepare article samples using prepare_random_sampling.py script. Creating folder with article samples can be done with this command: $ python prepare_random_sampling.py -o wikipedia -c 10 -l ru-en -v -o wikipedia - source of articles can be wikipedia or euronews -c 10 - number of articles to extract -l ru-en - languages to extract This script will create "article_samples" folder with articles files, then you can create manually aligned files (you need align article of second language), for this example you need to align "en" file, files named "_orig" - should be left unmodified Then manual aligning is ready you can run selection script here is example: $ python yalign_selection.py --samples article_samples/ --lang1 ru --lang2 en --threshold 0.1536422609112349e-6 --threshold_step 0.0000001 --threshold_step_count 10 --penalty 0.014928930455303857 --penalty_step 0.0001 --penalty_step_count 1 -m ru-en Here is what each parameter means: --samples article_samples/ - path to article samples folder --lang1 ru --lang2 en - languages to align (articles of second language should be aligned manually, script will be using "??_orig" files, align them automatically and will compare with manually aligned) --threshold 0.1536422609112349e-6 - threshold value of model, selection will be made around this value --threshold_step 0.0000001 - step of changing value --threshold_step_count 10 - number of steps to check below and above vaule, eg if value 10, step 1, and count 2, script will check 8 9 10 11 12 same parameters for penalty -m ru-en - path to yalign model Also you can use (to tweak comparison of text lines in files): --length and --similarity --length - min diffirence in length in order to mark lines similar, 1 - same length, 0.5 - at least half of length --similarity - similarity of text in lines, 1 - exactly same, 0 - completely different. For similarity check sentences compared as sequence of characters. It has multiprocessing support already. Use -t option to set number of threads, by default it sets number of threads equal to number of CPU. for additional parameters you can use '-h' key. Then yalign_selection.py script will finish work it will produce csv file, with first column equal to threshold, second column equal to penalty, and third is similarity for this parameters. Align with HUNALING method In order to use hunalign you need add "--hunalign" option in align.py script, here is example: $ python align.py -l li-hu -r align_result -o wikipedia --hunalign In my empirical study it provides better results when articles are translations of each other or simillar in leghth and content. Align From fodler For aligning already aligned texts using hunalign: Command exmaple is: $ python align_aligned_using_hunalign.py source/ target/ Final info Wołk, K., & Marasek, K. (2015, September). Tuned and GPU-accelerated parallel data mining from comparable corpora. In International Conference on Text, Speech, and Dialogue (pp. 32-40). Springer International Publishing. http://arxiv.org/pdf/1509.08639 For more detailed usage instruction see howto.pdf. For any questions: | Krzysztof Wolk | krzysztof@wolk.pl

Use "Parallel Corpora from Comparable Corpora tool"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

WiKNN Text Classifier

Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

CUBBITT Translation Models (en-pl) (v1.0)

CorpoGrabber

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

UDify Pretrained Model

CorPipe 24 Multilingual CorefUD 1.2 Model (corpipe24-corefud1.2-240906)

Parallel Corpora from Comparable Corpora tool