Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0

    This Conformer CTC BPE E2E Automated Speech Recognition model was trained following the NVIDIA NeMo Conformer-CTC fine-tuning recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for transcribing Slovene speech to text. The starting point was the Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0, which was fine-tuned on the Protoverb closed dataset. The model was fine-tuned for 20 epochs, which improved the performance on the Protoverb test dataset for 9.8% relative WER, and for 3.3% relative WER on the Slobench dataset.
  • Visible Vowels (2017-05-29)

    This program enables the user to visualize f0 contours, to plot vowels in the F1/F2 space for multiple points in the vowel interval, e.g. at 20%, 50% and 80%, and to visualize vowel durations. (The tool is implemented in R. We used the following packages: phonR, gplots, plotrix, lattice, readxl, WriteXLS, DT, psych and pracma. We thank the developers of these packages.)
  • TTS Document Reader (22.10)

    ENGLISH: This project contains a web application where users can upload a text document for reading by a TTS system. To set up this application you need a connection to a TTS service, the communication with the service is implemented in a way that matches the TTS service interface as implemented by the LT program. The application is written in Ruby-on-Rails. ÍSLENSKA: Þessi hugbúnaðarpakki inniheldur vefviðmótið Skjalalestur. Notendur geta hlaðið upp textaskjölum og fengið .mp3-skrá til baka þar sem skjalið hefur verið lesið af talgervli. Til þess að setja upp viðmótið þarf að hafa aðgang að talgervilsþjónustu. Ef nota á hugbúnaðinn óbreyttan þarf talgervilsþjónustan að hafa sömu forritunarskil (API) og talgervilsþjónustan sem þróuð var innan íslensku máltækniáætlunarinnar.
  • WordnetLoom 1.68.2

    WordnetLoom – is an wordnet editor application built for the needs of the construction of a the largest Polish wordnet called plWordNet. WordnetLoom provides two means of interaction: a form-based, implemented initially, and a visual, graph-based introduced recently. The visual, graph-based interactive presentation of the wordnet structure enables browsing and its direct editing on the structure of lexico-semantic relations and synsets. WordnetLooms works in a distributed environment, i.e. several linguists can work simulanuously from different sites on the same central database.
  • Yfirlestur 1.0.0 (22.06)

    Yfirlestur.is is a public website where you can enter or submit your Icelandic text and have it checked for spelling and grammar errors. The tool also gives hints on words and structures that might not be appropriate, depending on the intended audience for the text. The core spelling and grammar checking functionality of Yfirlestur.is is provided by the GreynirCorrect engine, by the same authors. This software is licensed under the MIT License. More information at https://github.com/mideind/Yfirlestur.
  • Icelandic GPT-SW3 for spell and grammar checking

    Icelandic GPT-SW3 for spell and grammar checking is a GPT-SW3 model fine-tuned on Icelandic and particularly on the spell and grammar checking task. The 6.7B GPT-SW3 model (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) was pre-trained on Icelandic texts and fine-tuned on Icelandic error corpora. Texts for pre-training included texts from the Icelandic Gigaword Corpus (http://hdl.handle.net/20.500.12537/253) and MÍM (http://hdl.handle.net/20.500.12537/195). For fine-tuning, the following Icelandic error corpora were used: the Icelandic Error Corpus (http://hdl.handle.net/20.500.12537/105), the Icelandic L2 Error Corpus (http://hdl.handle.net/20.500.12537/280), the Icelandic Dyslexia Error Corpus (http://hdl.handle.net/20.500.12537/281), and the Icelandic Child Language Error Corpus (http://hdl.handle.net/20.500.12537/133). The model is fine-tuned on three different tasks: - Task 1: The model evaluates one text with regards to e.g. grammar and spelling, and returns all errors in the input text as a list, with their position in the text and their corrections. - Task 2: The model evaluates two texts and chooses which one is better with regards to e.g. grammar and spelling. - Task 3: The model evaluates one text with regards to e.g. grammar and spelling, and returns a corrected version of the text. For task 1, the model delivers a 0.28 F0.5 score on the Grammatical Error Correction Test Set (http://hdl.handle.net/20.500.12537/320) and for task 2, the model delivers a 63.95% accuracy score on the same test set. For task 3, the model scores 0.925559 on the GLEU metric (modified BLEU for grammatical error correction) and 0.02 in TER (translation error rate). Íslenskt GPT-SW3 fyrir málfræði- og stafsetningarleiðréttingu er GPT-SW3-líkan sem hefur verið fínþjálfað á íslensku og sérstaklega í málfræði- og stafsetningarleiðréttingu. 6,7 milljarða stika GPT-SW3-líkan (https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b) var forþjálfað á íslenskum textum og fínþjálfað á íslenskum villumálheildum. Forþjálfunartextar samanstóðu m.a. af textum úr Risamálheildinni (http://hdl.handle.net/20.500.12537/253) og MÍM (http://hdl.handle.net/20.500.12537/195). Í fínþjálfun voru eftirfarandi villumálheildir notaðar: íslenska villumálheildin (http://hdl.handle.net/20.500.12537/105), íslenska annarsmálsvillumálheildin (http://hdl.handle.net/20.500.12537/280), íslenska dyslexíuvillumálheildin (http://hdl.handle.net/20.500.12537/281) og íslenska barnamálsmálheildin (http://hdl.handle.net/20.500.12537/133). Líkanið er fínþjálfað á þremur mismunandi verkefnum: - Verkefni 1: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar öllum villum í inntakstexta sem lista, þar sem staðsetning þeirra í textanum er tekin fram ásamt leiðréttum myndum þeirra. - Verkefni 2: Líkanið metur tvo texta og velur hvor þeirra er betri hvað varðar t.d. málfræði og stafsetningu. - Verkefni 3: Líkanið metur einn texta hvað varðar t.d. málfræði og stafsetningu og skilar leiðréttri útgáfu af textanum. Í verkefni 1 skilar líkanið 0.28 F0.5-skori þegar það er metið á Prófunarmengi fyrir textaleiðréttingar (http://hdl.handle.net/20.500.12537/320) og í verkefni 2 skilar líkanið 63,95% nákvæmni þegar það er metið á sömu gögnum. Í verkefni 3 skorar líkanið 0.925559 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.02 villuhlutfall í þýðingu (translation error rate).
  • Talrómur Utils

    This is a collection of utilities for Text-to-speech (TTS) development using the Talrómur corpus. This collection includes: - Alignments for all the voices in Talrómur created with the Montreal Forced Aligner - Train, evaluation and test splits for all the voices in Talrómur - Two baseline TTS models and vocoder models Þetta er hjálparpakki fyrir Talrómsgagnasettið. Pakkinn inniheldur allt nauðsynlegt til að þróa og keyra talgervla búna til með Talrómi.
  • Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 1.0

    The purpose of the lexicon acquisition tool is to facilitate the development and expansion of online dictionaries and glossaries, particularly the Database of Modern Icelandic Inflection (DMII/BÍN) and ISLEX. The tool is designed around the Icelandic Gigaword Corpus (IGC) and the information contained within its TEI-formatted documents. That is to say, its best performance comes from using the available part-of-speech tags, lemmas and word forms defined in the IGC. The lexicon acquisition tool can however use any corpus as input that uses either the same TEI-format as is used in the IGC or a plain text file format, depending on the user's preference. The output files, examples of which are included, are the following: Frequency per word form with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per lemma with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per word form, including information on all possible lemmas for the given word forms. Provides information on whether the word form can belong to more than one word class, as well as whether or not the automatic lemmatization is working correctly. Frequency per lemma, including information on all possible word forms for the given lemma. To examine if a certain word form appears much more or less frequently than the others and thus if the word form is only used as a part of a certain expression. Frequency per lemma, including information in which types of text the particular lemma appears. The frequency for each individual text type can also be examined in descending order. Facilitates the creation of a specialized glossary (e.g. a glossary of sport related words). Also included is a list of approximately 60 thousand subwords, manually chosen from the ICG. These include foreign words, typos, misspelled words, lemmatization errors and acronyms. Tilgangur orðtökutólsins er að einfalda þróun og smíði netorðabóka og netorðasafna, einkum og sér í lagi Beygingarlýsingu íslensks nútímamáls (BÍN) og Nútímamálsorðabókarinnar (ISLEX). Smíði tólsins byggist að miklu leyti á notkun Risamálheildarinnar (RMH) og þeirra upplýsinga sem eru skilgreindar innan tei-sniðsins sem hún notar, en þar er helst átt við notkun málfræðilegra marka, nefnimynda og orðmynda sem þar er að finna. Orðtökutólið má aftur á móti nota með hvaða málheild sem er sé hún annað hvort á sama tei-sniði og Risamálheildin eða á einföldu txt-sniði. Dæmi um úttaksskjöl orðtökutólsins má finna í meðfylgjandi möppu. Þau eru eftirfarandi: Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar orðmyndir viðkomandi lemmu sem koma fyrir taldar upp. Nýtist til að kanna hvort tiltekin orðmynd er mun algengari en aðrar og þar með hvort orðið tilheyri einkum ákveðnu orðtaki. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar lemmur viðkomandi orðmyndar sem koma fyrir taldar upp. Veitir upplýsingar um hvort tiltekin orðmynd getur tilheyrt fleiri en einum orðflokki og hvort sjálfvirk lemmun skili réttum niðurstöðum. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en auk þess tíðni hverrar lemmu innan ákveðinnar gerðar texta (t.d. fréttir, stærðfræði eða fótbolti). Má nýta við smíði íðorðasafna. Meðfylgjandi er einnig listi sem inniheldur um 60 þúsund stopporð sem hefur verið safnað handvirkt úr Risamálheildinni. Þetta eru erlend orð, stafsetningar- og innsláttarvillur, lemmuvillur og skammstafanir.
  • Slavic Forest, Norwegian Wood (scripts)

    Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971). For each source (SS, e.g. sl) and target (TT, e.g. hr) language, you need to add the following into this directory: - treebanks (Universal Dependencies v1.4): SS-ud-train.conllu TT-ud-predPoS-dev.conllu - parallel data (OpenSubtitles from Opus): OpenSubtitles2016.SS-TT.SS OpenSubtitles2016.SS-TT.TT !!! If they are originally called ...TT-SS... instead of ...SS-TT..., you need to symlink them (or move, or copy) !!! - target tagging model TT.tagger.udpipe All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017 You also need to have: - Bash - Perl 5 - Python 3 - word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014 - udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017 - Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016 The most basic setup is the sl-hr one (train_sl-hr.sh): - normalization of deprels - 1:1 word-alignment of parallel data with Monolingual Greedy Aligner - simple word-by-word translation of source treebank - pre-training of target word embeddings - simplification of morpho feats (use only Case) - and finally, training and evaluating the parser Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in specific cases (see paper for details). Moreover, cs-sk also adds more morpho features, selecting those that seem to be very often shared in parallel data. The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.