CLARIN Tool Portal

703 record(s) found

Search results

Web-based Annotation Explorer

1 resources

Annex (Annotation Explorer) is a web-based tool for exploring and viewing annotated multimedia recordings in an archive. ANNEX can play audio and video files in a web browser along with annotations in a variety of formats: ELAN (EAF), Shoebox/Toolbox text, CHAT (CHILDES annotation format), Plain text, CSV, PDF, SubRip, Praat TextGrid, HTML and XML. Annex will visualise the annotations in synchrony with the media files as long as time-alignment information is available. If no time-alignment information is available, a default segment duration is assumed. Annex has a graphical interface that resembles the interface of the ELAN annotation tool to some extent, with a number of different view modes such as subtitle view, timeline view and grid view. Annex runs in any modern web browser with the Adobe Flash plugin (> version 10) installed. ANNEX has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ANNEX in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research and the the Max Planck Society.
Ucto Tokeniser Engine

1 resources

The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.
WMT21 Marian translation models (ca-ro,it,oc)

1 resources

Marian multilingual translation model from Catalan into Romanian, Italian and Occitan. Primary CUNI submission for WMT21 Multilingual Low-Resource Translation for Indo-European Languages Shared Task.

Use "WMT21 Marian translation models (ca-ro,it,oc)"
Saper

1 resources

Shallow semantic parser for polish texts processing. Contains word sense disambiguation, mapping go SUMO concepts and semantic role labelling.

Use "Saper"
The CLASSLA-Stanza model for semantic role labeling of standard Slovenian 2.0

2 resources

The model for semantic role labeling of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) extended with the MaCoCu-sl Slovenian web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the semantic role annotations is ~76.24. The difference to the previous version of the model is that the model was trained using the SUK training corpus and the updated word embeddings.

Use "The CLASSLA-Stanza model for semantic role labeling of standard Slovenian 2.0"
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0

2 resources

This model for named entity recognition of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.

Use "The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0"
SuperMatrix

2 resources

SuperMatrix is a system to support automatic extraction of semantic relations, based on the analysis of large text corpora. System was developed as a tool for expansion of Polish wordnet (Słowosieć).Expansion consist of two steps: system suggests a potential links between lexical units. Linguist verify these suggestions and decide which form will go to wordnet. This speeded up the work and preserve the integrity of data entry.

Use "SuperMatrix"
The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2

2 resources

This model for lemmatisation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.23.

Use "The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2"
The CLASSLA-StanfordNLP model for UD dependency parsing of standard Serbian

3 resources

The model for UD dependency parsing of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) and using the CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1206). The estimated LAS of the parser is ~89.0.

Use "The CLASSLA-StanfordNLP model for UD dependency parsing of standard Serbian"
The CLASSLA-StanfordNLP model for UD dependency parsing of standard Bulgarian 1.0

3 resources

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the UD-parsed portion of the BulTreeBank training corpus (http://hdl.handle.net/11495/D93F-C6E9-65D9-2) and using the CoNLL2017 word embeddings (http://hdl.handle.net/11234/1-1989). The estimated LAS of the parser is ~91.5.

Use "The CLASSLA-StanfordNLP model for UD dependency parsing of standard Bulgarian 1.0"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Web-based Annotation Explorer

Ucto Tokeniser Engine

WMT21 Marian translation models (ca-ro,it,oc)

Saper

The CLASSLA-Stanza model for semantic role labeling of standard Slovenian 2.0

The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0

SuperMatrix

The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Serbian

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Bulgarian 1.0