Result filters

Metadata provider

Language

Resource type

Availability

Active filters:

  • Project: Development of Slovene in a Digital Environment
Loading...
33 record(s) found

Search results

  • The CLASSLA-Stanza model for semantic role labeling of standard Slovenian 2.0

    The model for semantic role labeling of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) extended with the MaCoCu-sl Slovenian web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the semantic role annotations is ~76.24. The difference to the previous version of the model is that the model was trained using the SUK training corpus and the updated word embeddings.
  • The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2

    This model for lemmatisation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.23.
  • Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

    The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool Compared with the previous version (1.0), this version was trained for further 61 epochs (v1.0 37 epochs, v2.0 98 epochs), for a total of 200,000 iterations/updates. The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers (sloberta.2.0.transformers.tar.gz) or fairseq library https://github.com/pytorch/fairseq (sloberta.2.0.fairseq.tar.gz)
  • CORDEX inflectional lookup data 1.0

    The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").
  • The CLASSLA-Stanza model for JOS dependency parsing of standard Slovenian 2.0

    This model for JOS dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~93.89. The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses the updated embeddings.
  • PyTorch model for Slovenian Coreference Resolution

    Slovenian model for coreference resolution: a neural network based on a customized transformer architecture, usable with the code published on https://github.com/matejklemen/slovene-coreference-resolution. The model is based on the Slovenian CroSloEngual BERT 1.1 model (http://hdl.handle.net/11356/1330). It was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747), specifically the SentiCoref subcorpus. Using the evaluation setting where entity mentions are assumed to be correctly pre-detected, the model achieves the following metric values: MUC: precision = 0.931, recall = 0.957, F1 = 0.943 BCubed: precision = 0.887, recall = 0.947, F1 = 0.914 CEAFe: precision = 0.945, recall = 0.893, F1 = 0.916 CoNLL-12: precision = 0.921, recall = 0.932, F1 = 0.924
  • The CLASSLA-Stanza model for morphosyntactic annotation of non-standard Slovenian 2.1

    This model for morphosyntactic annotation of non-standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and the Janes-Tag corpus (http://hdl.handle.net/11356/1732), using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~92.17. The difference to the previous version of the model is that the model was trained on the SUK training corpus and the 3.0 version of Janes-tag, uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).
  • PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

    The SloNER is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library (https://github.com/huggingface/transformers). The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0 (http://hdl.handle.net/11356/1397). The model was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747).The source code of the model is available on GitHub repository https://github.com/clarinsi/SloNER.
  • The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0

    This model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.11. The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).