Result filters

Metadata provider

Language

  • Slovenian

Resource type

Availability

Organisation

Active filters:

  • Language: Slovenian
  • Project: Network of Research Infrastructure Centres (MRIC)
Loading...
3 record(s) found

Search results

  • CORDEX inflectional lookup data 1.0

    The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").
  • Text classification model SloBERTa-Trendi-Topics 1.0

    The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the SloBERTa 2.0 contextual embeddings model (http://hdl.handle.net/11356/1397; also available at HuggingFace: https://huggingface.co/EMBEDDIA/sloberta) and validated on a development set of 1,293 texts using the simpletransformers library and the following hyperparameters: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 2 The model achieves a macro-F1-score of 0.94 on a test set of 1,295 texts (best for "črna kronika", "politika", "šport", and "vreme" at 0.98, worst for "prosti čas" at 0.83). Please note that the fastText-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1710) that is faster and computationally less demanding, but achieves lower classification accuracy.
  • Text classification model fastText-Trendi-Topics 1.0

    The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code). The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67). Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.