Result filters

Metadata provider

Language

Resource type

Availability

Organisation

  • Jožef Stefan Institute

Active filters:

  • Organisation: Jožef Stefan Institute
  • Language: Croatian
Loading...
18 record(s) found

Search results

  • The CLASSLA-StanfordNLP model for named entity recognition of non-standard Croatian 1.0

    This model for named entity recognition of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.
  • The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard Croatian 1.0

    This model for morphosyntactic annotation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1210), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~95.11.
  • The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian

    The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.1.
  • The CLASSLA-StanfordNLP model for lemmatisation of non-standard Croatian 1.0

    The model for lemmatisation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1210), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~97.54.
  • The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1

    The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus (http://hdl.handle.net/11356/1792) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1790). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.87. The difference to the previous version of the model is that this version was trained using the new version of the hr500k corpus and the new version of the Croatian word embeddings.
  • Multilingual text genre classification model X-GENRE

    The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116). The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 15 For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model.
  • The CLASSLA-StanfordNLP model for lemmatisation of non-standard Croatian 1.1

    The model for lemmatisation of non-standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/) and the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~97.54. The difference to the previous version of the lemmatizer is that now it relies solely on XPOS annotations, and not on a combination of UPOS, FEATS (lexicon lookup) and XPOS (lemma prediction) annotations.
  • The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian

    The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). The estimated F1 of the lemma annotations is ~97.6.
  • The CLASSLA-Stanza model for UD dependency parsing of standard Croatian 2.1

    The model for UD dependency parsing of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the UD-parsed portion of the hr500k training corpus (http://hdl.handle.net/11356/1792) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1790). The estimated LAS of the parser is ~87.46. The difference to the previous version of the model is that this version was trained using the new version of the hr500k corpus and the new version of the Croatian word embeddings.
  • The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.1

    The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). The estimated F1 of the lemma annotations is ~97.6. The difference to the previous version of the model is that it is trained with the lemmatiser padding bug removed, cf. https://github.com/stanfordnlp/stanfordnlp/issues/143.