CLARIN Tool Portal

698 record(s) found

Search results

Malach Center User Interface 1.0

2 resources

Source code of the first full and running version for the Malach Center User Interface, does not contain data or metadata fo the digital objects and resources.

Use "Malach Center User Interface 1.0"
The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

2 resources

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the Sloleks inflectional lexicon (http://hdl.handle.net/11356/1230). The estimated F1 of the lemma annotations is ~99.0.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian"
The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian

3 resources

The model for UD dependency parsing of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the UD-parsed portion of the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The estimated LAS of the parser is ~85.9.

Use "The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian"
EdUKate translation software 1

2 resources

This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...). These tools are used in the Charles Translator service (https://translator.cuni.cz). This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.

Use "EdUKate translation software 1"
MorphoDiTa-based tagger for Polish language

4 resources

MorphoDiTa-based tagger for Polish language. It is a tool for morphosyntactic unification for the Polish language, according to the NKJP tagset.

Use "MorphoDiTa-based tagger for Polish language"
GlowTTS models for Talrómur1 (22.10)

6 resources

This release contains GlowTTS models for four different voices from the Talrómur 1 [1] corpus. The models were trained using the Coqui TTS library after it was adapted for Icelandic. Included is the model, model configuration, log file for the training and the recipe used for each model. Þessi útgáfa inniheldur þjálfuð GlowTTS módel fyrir fjórar mismunandi raddir úr Talrómur 1 [1] gagnasafninu. Módelin voru þjálfuð með Coqui TTS verkfærakistunni sem búið var að aðlaga fyrir íslensku. Innifalið fyrir hverja rödd er módelið, skjal með stillingum á módelinu, þjálfunarsaga og forskriftin sem var notuð. [1] http://hdl.handle.net/20.500.12537/104

Use "GlowTTS models for Talrómur1 (22.10)"
Semi-supervised Icelandic-Polish Translation System (22.09)

8 resources

This Icelandic-Polish translation model (bi-directional) was trained using fairseq (https://github.com/facebookresearch/fairseq) by means of semi-supervised translation by starting with the mBART50 model. The model was then trained using a multi-task curriculum to first learn to denoise sentences. Then the model was trained to translate using aligned parallel texts. Finally the model was provided with monolingual texts in both Icelandic and Polish with which it iteratively creates back-translations. For the PL-IS direction the model achieves a BLEU score of 27.60 on held out true parallel training data and 15.30 on the out-of-domain Flores devset. For the IS-PL direction the model achieves a score of 27.70 on the true data and 13.30 on the Flores devset. -- Þetta íslensk-pólska þýðingarlíkan (tvíátta) var þjálfað með fairseq (https://github.com/facebookresearch/fairseq) með hálf-sjálfvirkum aðferðum frá mBART50 líkaninu. Líkanið var þjálfað á þremur verkefnum, afruglun, samhliða þýðingum og bakþýðingum sem voru myndaðar á þjálfunartíma. Fyrir PL-IS áttina fæst BLEU skor 27.60 á raun gögnum sem voru tekin til hliðar og 15.30 á Flores þróunargögnunum. Fyrir IS-PL áttina fæst skor 27.70 á raun gögnunum og 13.30 á Flores þróunargögnunum.

Use "Semi-supervised Icelandic-Polish Translation System (22.09)"
CorpoGrabber

1 resources

CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.

Use "CorpoGrabber"
The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

3 resources

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the UD-parsed portion of the BulTreeBank training corpus (https://clarino.uib.no/korpuskel/corpora) and using the CLARIN.SI-embed.bg word embeddings (http://hdl.handle.net/11356/1796). The estimated LAS of the parser is ~91.18. The difference to the previous version of the parser is that this version was trained using the new version of the Bulgarian word embeddings.

Use "The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1"
EdUKate Czech-Ukrainian translation model 2024

2 resources

This package includes Czech-to-Ukrainian translation model adapted for the educational domain. The model is exported into the TensorFlow Serving format (using Tensor2tensor version 1.6.6), so it can be used in the Charles Translator service (https://translator.cuni.cz) and in the web portal Škola s nadhledem. This model was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.

Use "EdUKate Czech-Ukrainian translation model 2024"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Malach Center User Interface 1.0

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian

EdUKate translation software 1

MorphoDiTa-based tagger for Polish language

GlowTTS models for Talrómur1 (22.10)

Semi-supervised Icelandic-Polish Translation System (22.09)

CorpoGrabber

The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

EdUKate Czech-Ukrainian translation model 2024