CLARIN Tool Portal

PELCRA for National Corpus of Polish Search Engine 2

2 resources

The PELCRA for NKJP search engine 2 provides access to the full National Corpus of Polish dataset (over 1.5 billion word tokens). In addition to linguistically motivated corpus queries, it supports a number of data exploration and visualisation features. Most of the functionality of the search engine is available through a REST web service. Access to the API is available upon request.

Use "PELCRA for National Corpus of Polish Search Engine 2"

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.2

2 resources

The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). The estimated F1 of the lemma annotations is ~97.6. The difference to the previous version is that now it relies solely on XPOS annotations, and not on a combination of UPOS, FEATS (lexicon lookup) and XPOS (lemma prediction) annotations.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.2"

PyTorch model for Slovenian Coreference Resolution

2 resources

Slovenian model for coreference resolution: a neural network based on a customized transformer architecture, usable with the code published on https://github.com/matejklemen/slovene-coreference-resolution. The model is based on the Slovenian CroSloEngual BERT 1.1 model (http://hdl.handle.net/11356/1330). It was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747), specifically the SentiCoref subcorpus. Using the evaluation setting where entity mentions are assumed to be correctly pre-detected, the model achieves the following metric values: MUC: precision = 0.931, recall = 0.957, F1 = 0.943 BCubed: precision = 0.887, recall = 0.947, F1 = 0.914 CEAFe: precision = 0.945, recall = 0.893, F1 = 0.916 CoNLL-12: precision = 0.921, recall = 0.932, F1 = 0.924

Use "PyTorch model for Slovenian Coreference Resolution"

The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1

3 resources

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus (http://hdl.handle.net/11356/1792) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1790). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.87. The difference to the previous version of the model is that this version was trained using the new version of the hr500k corpus and the new version of the Croatian word embeddings.

Use "The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1"

KPWr n82 NER model (on Polish RoBERTa base)

2 resources

The named entity recognition model for fine-grained categories of entities (82 types) was trained on the KPWr corpus using Polish RoBERTa base language model. Details can be found on the following page: https://github.com/mczuk/xlm-roberta-ner

Use "KPWr n82 NER model (on Polish RoBERTa base)"

MCSQ Translation Models (en-ru) (v1.0)

2 resources

En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->ru: 64.3 (train: genuine in-domain MCSQ data) ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "MCSQ Translation Models (en-ru) (v1.0)"

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serbian 1.2

3 resources

The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) and using the CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1206). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~95.2. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serbian 1.2"

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data (https://hdl.handle.net/11234/1-3226). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_26_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)"

Corpus2MWE

2 resources

A CCL reader (Corpus2) with MWE detection.

Use "Corpus2MWE"

CORDEX inflectional lookup data 1.0

2 resources

The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").

Use "CORDEX inflectional lookup data 1.0"

Result filters

Metadata provider

Language

Resource type

Tool task

Field of study

Availability

Organisation

Project

Keywords

Active filters:

Search results

PELCRA for National Corpus of Polish Search Engine 2

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.2

PyTorch model for Slovenian Coreference Resolution

The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1

KPWr n82 NER model (on Polish RoBERTa base)

MCSQ Translation Models (en-ru) (v1.0)

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serbian 1.2

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

Corpus2MWE

CORDEX inflectional lookup data 1.0