Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • ItAntDSL

    The bundle contains: 1. ANTLR Lexer and Parser for a Domain-Specific Language named ItAntDSL, compliant with the EpiDoc conceptual model, to describe inscriptions in the languages of ancient Italy (in particular Venetic and Faliscan); 2. Visitor to convert ItAntDSL in XML-ItAnt The development of XSL(T) stylesheets to convert XML-ItAnt to XML-TEI/EpiDoc is in progress
  • Models for automatic g2p for Icelandic (20.10)

    Grapheme-to-phoneme (g2p) models for Icelandic, trained on an encoder-decoder LSTM neural network. The models are delivered with scripts for automatic transcription of Icelandic in the standard pronunciation variation, in the northern variation, north-east variation, and the south variation. To run the scripts the user needs to install Fairseq (see Readme in the project repository). Hljóðritunarlíkön fyrir íslensku, þjálfuð á LSTM tauganeti. Líkönunum fylgja skriftur til þess að hljóðrita íslensku skv. hefðbundnum framburði, harðmæli, rödduðum framburði og hv-framburði. Til þess að keyra skrifturnar þarf notandi að setja upp Fairseq (sjá nánari skjölun með verkefninu).
  • Text classification model SloBERTa-Trendi-Topics 1.0

    The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the SloBERTa 2.0 contextual embeddings model (http://hdl.handle.net/11356/1397; also available at HuggingFace: https://huggingface.co/EMBEDDIA/sloberta) and validated on a development set of 1,293 texts using the simpletransformers library and the following hyperparameters: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 2 The model achieves a macro-F1-score of 0.94 on a test set of 1,295 texts (best for "črna kronika", "politika", "šport", and "vreme" at 0.98, worst for "prosti čas" at 0.83). Please note that the fastText-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1710) that is faster and computationally less demanding, but achieves lower classification accuracy.
  • Corpus extraction tool LIST 1.2

    The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI P5 XML formats and outputs .CSV files that can be imported into Microsoft Excel or similar statistical processing software. Version 1.2 adds support for Gigafida 2.0 in XML format and fixes a bug which disabled the extraction of character-level n-grams from normalized forms in the GOS 1.0 corpus.
  • Q-CAT Corpus Annotation Tool 1.5

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations) Version 1.5 supports listening to audio recordings (provided in the # sound_url comment line in CONLL-U)
  • ELMo embeddings model, Slovenian

    ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus (https://viri.cjvt.si/gigafida/System/Impressum) for 10 epochs. 1,364,064 most common tokens were provided as vocabulary during the training. The model can also infer OOV words, since the neural network input is on the character level.
  • Lithuanian font family AISTIKA

    Original TrueType font designed and hinted in Lithuania. The font complies with the ISO/IEC 10646 (Unicode) standard and have the full set of casual and accented Lithuanian characters (e.g., į̃, ū̃, r̃, ė́, etc.). All the specific Lithuanian accented letters are presented in Private Use Area as well as available through pre-build compositional sequencies. The font also contains the main signs of Lithuanian heraldry as well as transcription signs and transliteration marks for Arabic, Indian, and other languages. The font family is presented in roman, bold, italic, and bold italic variants. UAB "Fotonija" grants dual permissive CLARIN-LT LICENCE (PUB) licence or SIL Open Font License for its font family Aistika. This statement has been issued by general director of UAB "Fotonija" Arūnas Samuilis and is valid from March 25, 2022 onwards.
  • Icelandic Gigaword Corpus JSONL Converter

    Icelandic Gigaword Corpus JSONL Converter is a tool for converting the unannotated version of the Icelandic Gigaword Corpus (IGC; http://hdl.handle.net/20.500.12537/253) to JSONL format. The converter takes in original XML files from IGC and converts them to JSONL format, adding information on the subcorpus' quality and domain, which is obtained from an attached file created by the Árni Magnússon Institute for Icelandic Studies. For further information on the output format, see the attached README. JSONL-varpari fyrir Risamálheild er tól til þess að varpa ómarkaðri útgáfu af Risamálheildinni (http://hdl.handle.net/20.500.12537/253) yfir á JSONL-snið. Varparinn tekur við upprunalegri XML-skrá Risamálheildarinnar og skilar henni á JSONL-sniði ásamt því að bæta við upplýsingum um gæði og óðal undirmálheildarinnar, en þær upplýsingar eru fengnar úr skjali sem fylgir með varparanum og var búið til af Stofnun Árna Magnússonar í íslenskum fræðum. Sjá README-skrá fyrir frekari upplýsingar um úttakssnið.
  • AnySoftKeyboard with custom autocompletion 22.10

    ENGLISH: This is a fork of the open source Android keyboard AnySoftKeyboard. This version contains a new autocompleter module based on finite-state-transducers (FST) as implemented in the Apache Lucene library. The autocompleter uses a bigram list from the Icelandic Gigaword Corpus (ICG, http://hdl.handle.net/20.500.12537/192) to enable next word suggestions from the beginning and not just after the user has used the keyboard for a certain amount of time, as implemented in the original keyboard. This version, however, still learns from the user, enhancing the original list with usage data and boosting frequently used combinations. ÍSLENSKA: Þetta er grein (e. fork) sem sveigð er frá opnu lyklaborði fyrir Android, AnySoftKeyboard. Þessi útgáfa inniheldur nýtt módúl fyrir ritspá, sem byggist á stöðuvélum Lucene hugbúnaðarins. Ritspáin notar orðatvístæður úr Íslenskri risamálheild (http://hdl.handle.net/20.500.12537/192) til þess að gera ritspá fyrir næsta orð mögulega strax þegar notandi byrjar að nota lyklaborðið, en ekki eingöngu byggða á fyrri notkun eins og upprunalega lyklaborðið. Þessi útgáfa lærir samt sem áður einnig af notkun, þannig að upprunalegi listinn breytist í takt við notkun en umfang hans helst.
  • IceEval - Icelandic Natural Language Processing Benchmark 22.09

    IceEval is a benchmark for evaluating and comparing the quality of pre-trained language models. The models are evaluated on a selection of four NLP tasks for Icelandic: part-of-speech tagging (using the MIM-GOLD corpus), named entity recognition (using the MIM-GOLD-NER corpus), dependency parsing (using the IcePaHC-UD corpus) and automatic text summarization (using the IceSum corpus). IceEval includes scripts for downloading the datasets, splitting them into training, validation and test splits and training and evaluating models for each task. The benchmark uses the Transformers, DiaParser and TransformerSum libraries for fine-tuning and evaluation. IceEval er tól til að meta og bera saman forþjálfuð mállíkön. Líkönin eru metin á fjórum máltækniverkefnum fyrir íslensku: mörkun (með MIM-GOLD málheildinni), nafnakennslum (með MIM-GOLD-NER málheildinni), þáttun (með IcePaHC-UD málheildinni) og sjálfvirkri samantekt (með IceSum málheildinni). IceEval inniheldur skriftur til að sækja gagnasöfnin, skipta þeim í þjálfunar- og prófunargögn og að fínstilla og meta líkön fyrir hvert verkefni. Transformers, DiaParser og TransformerSum forritasöfnin eru notuð til að fínstilla líkönin.