CLARIN Tool Portal

MCSQ Translation Models (en-ru) (v1.0)

2 resources

En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->ru: 64.3 (train: genuine in-domain MCSQ data) ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "MCSQ Translation Models (en-ru) (v1.0)"

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serbian 1.2

3 resources

The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) and using the CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1206). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~95.2. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serbian 1.2"

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data (https://hdl.handle.net/11234/1-3226). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_26_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)"

Corpus2MWE

2 resources

A CCL reader (Corpus2) with MWE detection.

Use "Corpus2MWE"

CORDEX inflectional lookup data 1.0

2 resources

The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").

Use "CORDEX inflectional lookup data 1.0"

Grafon

1 resources

Representation of sentence semantic with deepened semantic graphs. Graphs are composed based on the output of saper tool https://clarin-pl.eu/dspace/handle/11321/278

Use "Grafon"

SuperMatrix

2 resources

SuperMatrix is a system to support automatic extraction of semantic relations, based on the analysis of large text corpora. System was developed as a tool for expansion of Polish wordnet (Słowosieć).Expansion consist of two steps: system suggests a potential links between lexical units. Linguist verify these suggestions and decide which form will go to wordnet. This speeded up the work and preserve the integrity of data entry.

Use "SuperMatrix"

Malach Center User Interface 1.0

2 resources

Source code of the first full and running version for the Malach Center User Interface, does not contain data or metadata fo the digital objects and resources.

Use "Malach Center User Interface 1.0"

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

2 resources

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the Sloleks inflectional lexicon (http://hdl.handle.net/11356/1230). The estimated F1 of the lemma annotations is ~99.0.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian"

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian

3 resources

The model for UD dependency parsing of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the UD-parsed portion of the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The estimated LAS of the parser is ~85.9.

Use "The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

MCSQ Translation Models (en-ru) (v1.0)

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serbian 1.2

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

Corpus2MWE

CORDEX inflectional lookup data 1.0

Grafon

SuperMatrix

Malach Center User Interface 1.0

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian