CLARIN Tool Portal

703 record(s) found

Search results

tokenizer

4 resources

Tokenizer is a tool with wich one can design dedicated tokenizers for texts from some domain of interest.

Use "tokenizer"
Lithuanian speech-to-text Transcriber

1 resources

Speech to text automatic transcriber for Lithuanian is a containerized application implemented into 17 containers. It covers four areas: administrative, legal, medical and general spoken language. For the installation of Transcriber, we recommend the installation of Docker and Docker Compose. Demo service is provided at https://semantika.lt/Analysis/Transcriber, while IT solutions can be found at https://semantika.lt/Help/Info/Solutions. The transcription result is a set of files containing the same, but differently formatted information: plain text, a WebVTT format file (for subtitling purposes) and a file with data about transcription synchronization with the audio record. This latter file is intended for convenient editing of an audio-synchronized transcription. A transcription editor for this purpose can be found (http://semantikadocker.vdu.lt/files/transcription-editor-multi.zip).

Use "Lithuanian speech-to-text Transcriber"
OCR Post-Processing Transformer Model 23.04

2 resources

ENGLISH During the project L11 - Error models for OCR of The Language Technology Programme 2019-2023, various OCR post-processing models were trained. This is the best performing one. On texts from the 19th century to the early 20th century, it reduces word error rate from 6.49% to 3.08%, and character error rate from 1.39% to 0.73%. On modern texts, it reduces word error rate from 5.52% to 3.60% and character error rate from 1.17% to 1.0%. More info, such as how to use the model for inference, in README. ICELANDIC Í verkefninu L11 - Error models for OCR í Máltækniáætlun 2019-2023 voru nokkur ljóslestrarvilluleiðréttingarlíkön þjálfuð. Þetta er best þeirra. Líkanið lækkar hlutfall orðavillna (e. word error rate) úr 6,49% í 3,08% í textum frá 19. öld og fyrri hluta 20. aldar og hlutfall stafvillna úr 1,39% í 0,73%. Í nútímamálstextum lækkar það hlutfall orðavillna úr 5,52% í 3,60% og hlutfall stafvillna úr 1,17% í 1,0%. Nánari upplýsingar, svo sem hvernig má nota líkanið, er að finna í meðfylgjandi README-skjali.

Use "OCR Post-Processing Transformer Model 23.04"
Debiasing Algorithm through Model Adaptation

2 resources

Debiasing Algorithm through Model Adaptation (DAMA) is based on guarding stereotypical gender signals and model editing. DAMA is performed on specific modules prone to convey gender bias, as shown by causal tracing. Our novel method effectively reduces gender bias in LLaMA models in three diagnostic tests: generation, coreference (WinoBias), and stereotypical sentence likelihood (StereoSet). The method does not change the model’s architecture, parameter count, or inference cost. We have also shown that the model’s performance in language modeling and a diverse set of downstream tasks is almost unaffected. This package contains both the source codes and English, English-to-Czech, and English-to-German datasets.

Use "Debiasing Algorithm through Model Adaptation"
Slovenian commonsense reasoning model SloMET-ATOMIC 2020

2 resources

The SloMET-ATOMIC 2020 is a Slovene commonsense reasoning model that is able to predict commonsense descriptions in a natural language for a given input sentence. The model is an adaptation of the Slovene GPT-2 model (https://huggingface.co/cjvt/gpt-sl-base) that has been finetuned using the SloATOMIC 2020 corpus (http://hdl.handle.net/11356/1724), consisting of 1.33M everyday interence knowledge tuples about entities and events. The released model is a pytorch neural network model, intended for usage with the transformers library (https://github.com/huggingface/transformers).

Use "Slovenian commonsense reasoning model SloMET-ATOMIC 2020"
Speech Corpora Toolkit (22.06)

2 resources

[ENGLISH] Speech corpora toolkit is a collection of tools for processing audio and scripts to prepare them for segmentation and alignment. The output for each source is standardized. [ÍSLENSKA] Tækjasafn fyrir talmálsheildir er samansafn af tólum til að vinna hljóð og handrit yfir á staðlað form sem gerir þau tilbúin fyrir niðurbútun og samröðun.

Use "Speech Corpora Toolkit (22.06)"
Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

5 resources

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers.

Use "Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0"
Spellchecking app for Android (22.10)

2 resources

ENGLISH: This is an Android application which provides spell and grammar checking for Icelandic. The app is available on Google Play Store under the name "Réttritun". The source code is written in Kotlin and could be used as a base for Android app projects that need an Icelandic spell checking service. The app uses the spell checker service as impelmented by Miðeind ehf. in the Language Technology Program. See also: http://hdl.handle.net/20.500.12537/266 and http://hdl.handle.net/20.500.12537/270 ÍSLENSKA: Réttritun er Android app sem býður upp á málrýni fyrir íslensku. Appið er hægt að nálgast á Google Play Store. Kóðinn er skrifaður í Kotlin og gæti verið notaður sem grunnur fyrir önnur Android app verkefni sem vilja nýta málrýni fyrir íslensku. Appið notar málrýniþjónustu eins og þá sem Miðeind ehf. þróaði innan Máltækniáætlunarinnar. Sjá: http://hdl.handle.net/20.500.12537/266 and http://hdl.handle.net/20.500.12537/270

Use "Spellchecking app for Android (22.10)"
RÚV-DI Speaker Diarization (20.09)

2 resources

These are a set of speaker diarization recipes which depend on the speech toolkit Kaldi. There are two types of recipes here. First are recipes used for decoding unseen audio. The second type of recipes are for training diarization models on the Rúv-di data. This tool also lists the DER for the Rúv-di dataset on most of the recipes. All DERs within this tool have no unscored collars and include overlapping speech

Use "RÚV-DI Speaker Diarization (20.09)"
Face-domain-specific automatic speech recognition models

2 resources

This entry contains all the files required to implement face-domain-specific automatic speech recognition (ASR) applications using the Kaldi ASR toolkit (https://github.com/kaldi-asr/kaldi), including the acoustic model, language model, and other relevant files. It also includes all the scripts and configuration files needed to use these models for implementing face-domain-specific automatic speech recognition. The acoustic model was trained using the relevant Kaldi ASR tools (https://github.com/kaldi-asr/kaldi) and the Artur speech corpus (http://hdl.handle.net/11356/1776; http://hdl.handle.net/11356/1772). The language model was trained using the domain-specific text data involving face descriptions obtained by translating the Face2Text English dataset (https://github.com/mtanti/face2text-dataset) into the Slovenian language. These models, combined with other necessary files like the HCLG.fst and decoding scripts, enable the implementation of face-domain-specific ASR applications. Two speech corpora ("test" and "obrazi") and two Kaldi ASR models ("graph_splosni" and "graph_obrazi") can be selected for conducting speech recognition tests by setting the variable "graph" and "test_sets" in the "local/test_recognition.sh" script. Acoustic speech features can be extracted and speech recognition tests can be conducted using the "local/test_recognition.sh" script. Speech recognition test results can be obtained using the "results.sh" script. The KALDI_ROOT environment variable also needs to be set in the script "path.sh" to set the path to the Kaldi ASR toolkit installation folder.

Use "Face-domain-specific automatic speech recognition models"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

tokenizer

Lithuanian speech-to-text Transcriber

OCR Post-Processing Transformer Model 23.04

Debiasing Algorithm through Model Adaptation

Slovenian commonsense reasoning model SloMET-ATOMIC 2020

Speech Corpora Toolkit (22.06)

Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

Spellchecking app for Android (22.10)

RÚV-DI Speaker Diarization (20.09)

Face-domain-specific automatic speech recognition models