Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Keywords

Active filters:

  • Project: Development of Slovene in a Digital Environment
  • Keywords: lemmatisation
Loading...
3 record(s) found

Search results

  • The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2

    This model for lemmatisation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.23.
  • The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0

    This model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.11. The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).
  • The CLASSLA-Stanza model for lemmatisation of non-standard Slovenian 2.1

    This model for lemmatisation of non-standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and the Janes-Tag corpus (http://hdl.handle.net/11356/1732), using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~91.45. The difference to the previous version of the model is that the model was trained on the SUK training corpus and the 3.0 version of Janes-tag, uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).