Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • The CLASSLA-StanfordNLP model for morphosyntactic annotation of non-standard Serbian 1.0

    This model for morphosyntactic annotation of non-standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200), the ReLDI-NormTagNER-sr corpus (http://hdl.handle.net/11356/1240), the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1241), the hr500k training corpus (http://hdl.handle.net/11356/1210) and the RAPUT corpus (https://www.aclweb.org/anthology/L16-1513/), using the CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1206). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.91.
  • Samrómur-Children Demonstration Scripts 22.01

    The "Samrómur-Children Demonstration Scripts 22.01" is a set of three code recipes intended to show how to integrate the corpus "Samrómur Children's Icelandic Speech Data 21.09" and the "Icelandic Language Models with Pronunciations 22.01" to create automatic speech recognition systems using the Kaldi toolkit. „Samrómur-Sýnisforskriftir fyrir börn 22.01“ er safn af þremur talgreiningarforskriftum sem sýna hvernig má beita talmálheildinni „Samrómur-Íslensk talgögn frá börnum 21.09“ ásamt „Íslenskum mállíkönum með framburðarorðabók 22.01“ til þess að byggja talgreiningarkerfi með verkfærakistunni Kaldi.
  • Poliqarp2

    Poliqarp2 is a linguistic search engine, capable of searching through large corpora annotated on multiple levels. It is not an upgraded version of Poliqarp, it is a completely new software developed from scratch.
  • CorpoGrabber

    CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.
  • GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English (1.0)

    These are the models in http://hdl.handle.net/20.500.12537/125 trained with 40% layer drop. They are suitable for inference using every other layer for optimized inference speed with lower translation performance. We refer to the prior submission for usage and the documentation on layerdrop at https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/examples/layerdrop/README.md. Þessi líkön eru þjálfuð með 40% laga missi (e. layer drop) á líkönunum í http://hdl.handle.net/20.500.12537/125. Þau henta vel til þýðinga þar sem er búið að henda öðru hverju lagi í netinu og þannig er hægt að hraða á þýðingum á kostnað gæða. Leiðbeiningar um notkun netanna er að finna með upphaflegu líkönunum og í notkunarleiðbeiningum Fairseq í https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/examples/layerdrop/README.md.
  • The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2

    This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~90.42. The difference to the previous version of the model is that the model was trained using the improved SUK 1.1 version of the training corpus.
  • Multi-speaker GlowTTS model for Talrómur 2 (prerelease) (22.10)

    This release includes a partially trained multi-speaker model using the GlowTTS architecture in the Coqui TTS library [1]. The model is trained on all of the speakers in the Talrómur 2 [2] corpus. The release includes the model, training log, model configuration file and the recipe used to train the model. The model included here is the best model available during the training at the time of publishing. At run time it is possible to choose any of the voices to produce a similar sounding synthesized voice. Þessi útgáfa inniheldur módel þjálfað á mörgum röddum með notkun GlowTTS nálgunarinnar í Coqui TTS verkfærakistunni [1]. Módelið er þjálfað á öllum röddum í Talrómur 2 [2] gagnasafninu. Innifalið í pakkanum er módelið, þjálfunarsaga, skjal með stillingum fyrir módelið og forskriftin sem var notuð til að þjálfa módelið. Módelið sem er hér inni er besta módelið í þjálfunarferlinu á þeim tíma sem þetta er gefið út. Þegar módelið er keyrt er hægt að velja hvaða rödd sem er úr Talrómur 2 gagnasafninu til að búa til upptöku með sambærilegri rödd. [1] https://github.com/cadia-lvl/coqui-ai-TTS/releases/tag/M9 [2] http://hdl.handle.net/20.500.12537/167