Result filters

Metadata provider

Language

Resource type

Availability

Organisation

  • Jožef Stefan Institute

Active filters:

  • Organisation: Jožef Stefan Institute
  • Project: Treebank-Driven Approach to the Study of Spoken Slovenian
Loading...
6 record(s) found

Search results

  • The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2

    This model for lemmatisation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.23.
  • The CLASSLA-Stanza model for UD dependency parsing of spoken Slovenian 2.2

    This model for UD dependency parsing of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~81.91.
  • The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2

    This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~90.42. The difference to the previous version of the model is that the model was trained using the improved SUK 1.1 version of the training corpus.
  • The CLASSLA-Stanza model for morphosyntactic annotation of spoken Slovenian 2.2

    This model for morphosyntactic annotation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.76.
  • The CLASSLA-Stanza model for named entity recognition of standard Slovenian 2.2

    This model for named entity recognition of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl 2.0 word embeddings (http://hdl.handle.net/11356/1791). The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings.
  • Q-CAT Corpus Annotation Tool 1.5

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations) Version 1.5 supports listening to audio recordings (provided in the # sound_url comment line in CONLL-U)