A VIADAT module; the purpose of VIADAT-STAT is statistical analysis of recordings stored by the platform.
Developed in cooperation with ÚSD AV ČR and NFA.
ZRCalo is an open font meant to gradually phase out the ZRCola font as one of the components of the ZRCola 2 input system (http://hdl.handle.net/11356/1090). The current version is a baseline variant covering the basic Latin Unicode blocks. Future versions will aim to build on Unicode's combining characters mechanic to replace ZRCola's extensive use of the Private Use Area.
This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~90.42.
The difference to the previous version of the model is that the model was trained using the improved SUK 1.1 version of the training corpus.
The "Samrómur DeepSpeech Recipe 22.06" is a code recipe intended to show how to integrate the corpus "Samromur 21.05" [1] and the "DeepSpeech Scorer for Icelandic 22.06" [2] to create automatic speech recognition systems using the Mozilla's DeepSpeech recognizer [3].
This model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and the Janes-Tag corpus (http://hdl.handle.net/11356/1732), using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.17.
The difference to the previous version of the model is that the model was trained on the SUK training corpus and the 3.0 version of Janes-tag, uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).
This model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~98.27.
The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).
The Icelandic Error Corpus (IEC) was used to fine tune the Icelandic language model IceBERT for sentence classification. The objective was to train grammatical error detection models that could classify whether a sentence contains a particular error type. The model can mark sentences as including one or more of the following issues: coherence, grammar, orthography, other, style and vocabulary. The overall F1 score is a modest 64%.
---
Íslenska villumálheildin (IEC) var notuð til að fínþjálfa íslenska mállíkanið IceBERT fyrir flokkun á setningum. Markmiðið var að þjálfa líkan sem getur greint hvort setning innihaldi ákveðna villutegund. Líkanið getur merkt við setningar með einum eða fleiri mörkum af eftirfarandi: coherence, grammar, orthography, other, style og vocabulary. F1 yfir heildina er 64%.
This model for named entity recognition of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200), the hr500k training corpus (http://hdl.handle.net/11356/1183), the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240) and the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), using the CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1206). The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.
EVALD 1.0 for Foreigners is a software for automatic evaluation of surface coherence (cohesion) in Czech texts written by non-native speakers of Czech.