CLARIN Tool Portal

Toposław

2 resources

Toposław is an editor of multi-word unit inflection lexicons.

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.1

2 resources

The model for lemmatisation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200), the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the hr500k training corpus (http://hdl.handle.net/11356/1183) and the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/), using the srLex inflectional lexicon (http://hdl.handle.net/11356/1233). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~97.62. The difference to the previous version of the lemmatizer is that now it relies solely on XPOS annotations, and not on a combination of UPOS, FEATS (lexicon lookup) and XPOS (lemma prediction) annotations.

Use "The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.1"

The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonian 2.1

3 resources

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published) and using the Macedonian CLARIN.SI word embeddings (http://hdl.handle.net/11356/1788). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.14. The difference from the previous version is that this version was trained using a larger training dataset and the new version of the Macedonian word embeddings.

Use "The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonian 2.1"

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.0

3 resources

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training corpus (to be published) and using the Macedonian CLARIN.SI word embeddings (http://hdl.handle.net/11356/1359). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.6.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.0"

Tagger SentiOne - version 1

2 resources

The SentiOne tagger is a tagger for the Polish language adapted to processing of user-generated content. It was trained on the Polish UGC-corpus (prepared within the same research project and soon to become available in the CLARIN repository).

Use "Tagger SentiOne - version 1"

A lexicographical browser for DjVu

4 resources

The program is an indexer and browser for the scans of lexicographical paper slips. The slips are presented in DjVu format and an appropriate relational database stores the information about them. The integration of three approaches: incremental search, binary search and the so-called occasional indexing which consists in refinement of the stored information while searching, offers easy and convenient browsing.

Use "A lexicographical browser for DjVu"

Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

9 resources

This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture. There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased. Feel free to contact the authors of this submission in case you run into problems!

Use "Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)"

EVALD 1.0

3 resources

EVALD 1.0 serves for automatic evaluation of surface coherence (cohesion) in Czech texts written by native speakers of Czech.

Use "EVALD 1.0"

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data (https://hdl.handle.net/11234/1-5787). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_215_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)"

Depfix: Automatic Post-editing of SMT

4 resources

Depfix, a tool for Automatic Post-editing of SMT. See the project website for more information.

Use "Depfix: Automatic Post-editing of SMT"

Result filters

Metadata provider

Language

Resource type

Tool task

Field of study

Availability

Organisation

Project

Keywords

Active filters:

Search results

Toposław

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Serbian 1.1

The CLASSLA-Stanza model for morphosyntactic annotation of standard Macedonian 2.1

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.0

Tagger SentiOne - version 1

A lexicographical browser for DjVu

Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

EVALD 1.0

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

Depfix: Automatic Post-editing of SMT