Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data (https://hdl.handle.net/11234/1-5787). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_215_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • LVBERT - Latvian BERT

    LVBERT is the first publicly available monolingual BERT language model pre-trained for Latvian. For training we used the original implementation of BERT on TensorFlow with the whole-word masking and the next sentence prediction objectives. We used BERT-BASE configuration with 12 layers, 768 hidden units, 12 heads, 128 sequence length, 128 mini-batch size and 32,000 token vocabulary.
  • GlowTTS models for Talrómur1 (22.10)

    This release contains GlowTTS models for four different voices from the Talrómur 1 [1] corpus. The models were trained using the Coqui TTS library after it was adapted for Icelandic. Included is the model, model configuration, log file for the training and the recipe used for each model. Þessi útgáfa inniheldur þjálfuð GlowTTS módel fyrir fjórar mismunandi raddir úr Talrómur 1 [1] gagnasafninu. Módelin voru þjálfuð með Coqui TTS verkfærakistunni sem búið var að aðlaga fyrir íslensku. Innifalið fyrir hverja rödd er módelið, skjal með stillingum á módelinu, þjálfunarsaga og forskriftin sem var notuð. [1] http://hdl.handle.net/20.500.12537/104
  • Slavic Forest, Norwegian Wood (models)

    Trained models for UDPipe used to produce our final submission to the Vardial 2017 CLP shared task (https://bitbucket.org/hy-crossNLP/vardial2017). The SK model was trained on CS data, the HR model on SL data, and the SV model on a concatenation of DA and NO data. The scripts and commands used to create the models are part of separate submission (http://hdl.handle.net/11234/1-1970). The models were trained with UDPipe version 3e65d69 from 3rd Jan 2017, obtained from https://github.com/ufal/udpipe -- their functionality with newer or older versions of UDPipe is not guaranteed. We list here the Bash command sequences that can be used to reproduce our results submitted to VarDial 2017. The input files must be in CoNLLU format. The models only use the form, UPOS, and Universal Features fields (SK only uses the form). You must have UDPipe installed. The feats2FEAT.py script, which prunes the universal features, is bundled with this submission. SK -- tag and parse with the model: udpipe --tag --parse sk-translex.v2.norm.feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu A slightly better after-deadline model (sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe), which we mention in the accompanying paper, is also included. It is applied in the same way (udpipe --tag --parse sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu). HR -- prune the Features to keep only Case and parse with the model: python3 feats2FEAT.py Case < hr-ud-predPoS-test.conllu | udpipe --parse hr-translex.v2.norm.Case.w2v.trainonpred.udpipe NO -- put the UPOS annotation aside, tag Features with the model, merge with the left-aside UPOS annotation, and parse with the model (this hassle is because UDPipe cannot be told to keep UPOS and only change Features): cut -f1-4 no-ud-predPoS-test.conllu > tmp udpipe --tag no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe no-ud-predPoS-test.conllu | cut -f5- | paste tmp - | sed 's/^\t$//' | udpipe --parse no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe
  • Database of the Western South Slavic Verb HyperVerb -- Derivation

    The verbal Western South Slavic database (WeSoSlaV) contains 3000 most frequent Slovenian and 5300 most frequent BCS verbs which are all coded for a number of properties related to verb derivation. The database is a table where each verb is given a row of its own. The coded properties are organized in columns. Verbs in the database are coded for the following properties: root information, whether or not the verb has prefixes and the identity of the included prefix(es), whether or not the verb has suffixes and the identity of the included suffix(es) etc. All coded properties are explained in the accompanying pdf file.
  • The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0

    This model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.11. The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).
  • NELexicon2

    NELexicon2 to rozszerzona wersją gazetteera nazw własnych, która zawiera ponad 2,3 miliona unikalnych napisów. NELexicon został wzmogacony o następujące zasoby: - zdrobnienia imion, - obcojęzyczne formy polskich imion, - nazwy wyciągnięte z infoboxów polskiej Wikipedii, - formy odmiany nazw z infoboxów polskiej Wikipedii wyciągnięte z linków wewnętrznych Wikipedii, - lista nazw rozpoznanych przez Liner2 z modelem 56 nam o liczbie wystąpień równej lub większej niż 5. Jako, że nazwy zostały rozpoznane automatycznie, to lista może zawierać błędnie rozpoznane nazwy. - formy odmiany nazw wyciągnięte z polskiego Wikisłownika.
  • CUBBITT Translation Models (en-pl) (v1.0)

    CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)
  • The CLASSLA-Stanza model for UD dependency parsing of spoken Slovenian 2.2

    This model for UD dependency parsing of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~81.91.