Result filters

Metadata provider

Language

Availability

  • Share-alike

Active filters:

  • Language: Icelandic
  • Availability: Share-alike
Loading...
8 record(s) found

Search results

  • PICCL: Philosophical Integrator of Computational and Corpus Libraries

    PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.
    Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf
    PICCL
  • Multilabel Error Classifier (Icelandic Error Corpus categories) for Sentences (22.01)

    The Icelandic Error Corpus (IEC) was used to fine tune the Icelandic language model IceBERT for sentence classification. The objective was to train grammatical error detection models that could classify whether a sentence contains a particular error type. The model can mark sentences as including one or more of the following issues: coherence, grammar, orthography, other, style and vocabulary. The overall F1 score is a modest 64%. --- Íslenska villumálheildin (IEC) var notuð til að fínþjálfa íslenska mállíkanið IceBERT fyrir flokkun á setningum. Markmiðið var að þjálfa líkan sem getur greint hvort setning innihaldi ákveðna villutegund. Líkanið getur merkt við setningar með einum eða fleiri mörkum af eftirfarandi: coherence, grammar, orthography, other, style og vocabulary. F1 yfir heildina er 64%.
  • Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • Multilingual text genre classification model X-GENRE

    The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116). The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 15 For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model.
  • Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data (https://hdl.handle.net/11234/1-5787). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_215_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data (https://hdl.handle.net/11234/1-5150). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • Skiptir (20.10)

    A simple command-line tool that uses Pyphen (https://pyphen.org) to hyphenate text according to the newest hyphenation patterns from the Icelandic Hyphenation Dictionary (http://hdl.handle.net/20.500.12537/86). Can also be used as a module in Python.
  • Optimized Long Context Translation Models for English-Icelandic translations (22.09)

    ENGLISH: These models are optimized versions of the translation models released in http://hdl.handle.net/20.500.12537/278. Instead of the 24 layers used in the full model, they have been shrunk down to 7 layers. The computational resources required to run inference on the models is thus significantly less than using the original models. Performance is comparable to the original models when evaluated on general topics such as news, but for expert knowledge from the training data (e.g. EEA regulations) the original models are more capable. The models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön eru smækkaðar útgáfur af líkönunum sem má finna á http://hdl.handle.net/20.500.12537/278 . Upphaflegu líkönin eru með 24 lög en þessar útgáfur eru með 7 lög og eru skilvirkari í keyrslu. Frammistaða líkananna er á pari við þau upphaflegu fyrir almennan texta, svo sem í fréttum. Á sérhæfðari texta sem er að finna í þjálfunargögnunum standa þau sig verr, t.d. á evrópureglugerðum. Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhliða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, Íslendingasögunum, opnum íslenskum rafbókum, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðinu, Skýrslu Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu.