Result filters

Metadata provider

Language

Resource type

Availability

Active filters:

  • Language: Icelandic
  • Availability: Public
Loading...
28 record(s) found

Search results

  • PICCL: Philosophical Integrator of Computational and Corpus Libraries

    PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.
    Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf
    PICCL
  • Multilabel Error Classifier (Icelandic Error Corpus categories) for Sentences (22.01)

    The Icelandic Error Corpus (IEC) was used to fine tune the Icelandic language model IceBERT for sentence classification. The objective was to train grammatical error detection models that could classify whether a sentence contains a particular error type. The model can mark sentences as including one or more of the following issues: coherence, grammar, orthography, other, style and vocabulary. The overall F1 score is a modest 64%. --- Íslenska villumálheildin (IEC) var notuð til að fínþjálfa íslenska mállíkanið IceBERT fyrir flokkun á setningum. Markmiðið var að þjálfa líkan sem getur greint hvort setning innihaldi ákveðna villutegund. Líkanið getur merkt við setningar með einum eða fleiri mörkum af eftirfarandi: coherence, grammar, orthography, other, style og vocabulary. F1 yfir heildina er 64%.
  • Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • Multilingual text genre classification model X-GENRE

    The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116). The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 15 For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model.
  • Long Context Translation Models for English-Icelandic translations (22.09)

    ENGLISH: These models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhlíða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, íslendingasögurnar, opnar íslenskar rafbækur, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðið, Skýrsla Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu.
  • RÚV-DI Speaker Diarization v5 models (21.05)

    English This archive contains files generated from the recipe in kaldi-speaker-diarization/v5/. Its contents should be placed in a similar directory type, with symbolic links to diarization/, sid/, steps/, etc. It was created when Kaldi's master branch was at git commit 321d3959dabf667ea73cc98881400614308ccbbb. v5 These models are trained on the Althingi Parliamentary Speech corpus available on malfong.is. It uses MFCCS, x-vectors, PLDA and AHC. The recipe uses the Icelandic Rúv-di corpus as two hold out sets for tuning parameters. The Icelandic Rúv-di corpus is currently not publicly available. Íslenska Þetta skjalasafn inniheldur skrár frá kaldi-speaker-diarization v5. Innihaldi skjalasafnsins ætti að setja í eins möppu, með hlekki (symlinks) á diarization, sid, steps, o.s.frv. Notast var við Kaldi af master grein og Git commit 321d3959dabf667ea73cc98881400614308ccbbb. v5 Þessi líkön eru þjálfuð á gagnasafninu Alþingisræður til talgreiningar sem er aðgengilegt á malfong.is. Þau nota MFCC, x-vigra, PLDA, og AHC. Uppskriftin notar RÚV-di gagnasafnið sem hold-out gagnasöfn til að stilla forsendur. Eins og er þá er RÚV-di gagnasafnið ekki aðgengilegt almenningi.
  • Semi-supervised Icelandic-Polish Translation System (22.09)

    This Icelandic-Polish translation model (bi-directional) was trained using fairseq (https://github.com/facebookresearch/fairseq) by means of semi-supervised translation by starting with the mBART50 model. The model was then trained using a multi-task curriculum to first learn to denoise sentences. Then the model was trained to translate using aligned parallel texts. Finally the model was provided with monolingual texts in both Icelandic and Polish with which it iteratively creates back-translations. For the PL-IS direction the model achieves a BLEU score of 27.60 on held out true parallel training data and 15.30 on the out-of-domain Flores devset. For the IS-PL direction the model achieves a score of 27.70 on the true data and 13.30 on the Flores devset. -- Þetta íslensk-pólska þýðingarlíkan (tvíátta) var þjálfað með fairseq (https://github.com/facebookresearch/fairseq) með hálf-sjálfvirkum aðferðum frá mBART50 líkaninu. Líkanið var þjálfað á þremur verkefnum, afruglun, samhliða þýðingum og bakþýðingum sem voru myndaðar á þjálfunartíma. Fyrir PL-IS áttina fæst BLEU skor 27.60 á raun gögnum sem voru tekin til hliðar og 15.30 á Flores þróunargögnunum. Fyrir IS-PL áttina fæst skor 27.70 á raun gögnunum og 13.30 á Flores þróunargögnunum.