CLARIN Tool Portal

703 record(s) found

Search results

Categorial Parser

2 resources

This resource contains Categorial Syntactic-Semantic Parser „ENIAM”. The Github repository contains the code and information on how to use the tool.

Use "Categorial Parser"
Binary Error Classifier for Icelandic Sentences (22.09)

6 resources

The model is a fine-tuned byT5-base Transformer model for error detection in natural language. It is tuned for sentence classification using parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. The objective was to train a grammatical error detection model that could classify whether a sentence contains an error or not. The overall F1 score is 72.8% (precision: 76.3, recall: 71.7). --- Líkanið er byT5-base Transformer-líkan þjálfað til setningaflokkunar á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Tilgangurinn var að þjálfa líkan sem gæti sagt til um hvort líklegt væri að setning innihéldi villu eða ekki. F1 fyrir líkanið er 72,8% (nákvæmni: 76,3, heimt: 71,7).

Use "Binary Error Classifier for Icelandic Sentences (22.09)"
The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.1

3 resources

This model for morphosyntactic annotation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training corpus (to be published) and using the Macedonian CLARIN.SI word embeddings (http://hdl.handle.net/11356/1359). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.6. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.1"
Translation Models (en-de) (v1.0)

2 resources

En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->de: 25.9 de->en: 33.4 (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "Translation Models (en-de) (v1.0)"
Error Classifier (Icelandic Error Corpus categories) for Tokens (22.05)

1 resources

The Icelandic Error Corpus (http://hdl.handle.net/20.500.12537/73) was used to fine tune the Icelandic language model IceBERT-xlmr-ic3 for token classification. The objective was to train grammatical error detection models that could classify whether a token range contains a particular error type. The model can mark tokens as including one of the following issue categories: coherence, grammar, orthography, other, style and vocabulary. The overall F1 score is 71 and for individual categories as follows: coherence: 0; grammar: 63; orthography: 86; other: 0; vocabulary: 15.2.

Use "Error Classifier (Icelandic Error Corpus categories) for Tokens (22.05)"
Tagger SentiOne - version 1

2 resources

The SentiOne tagger is a tagger for the Polish language adapted to processing of user-generated content. It was trained on the Polish UGC-corpus (prepared within the same research project and soon to become available in the CLARIN repository).

Use "Tagger SentiOne - version 1"
CSTlemma version 8.1.2

2 resources

CSTlemma is a lemmatizer that treats pre- in- and suffixes alike. The CST's lemmatizer can be (and already is) trained for tens of languages, also ones that require lemmatization rules that change words by adding or removing prefixes and/or infixes to obtain the lemma for the word. In Dutch, for example, the word "afgemaakt" has the lemma "afmaken", so the "ge" has to be removed, an "a" has to be inserted and the "t"-ending must be replaced by "en". New in version 8 of CSTlemma is the possibility to output the rule by which a given word is transformed to its lemma. It is also possible to just output a unique identifier for that rule - in practice, this identifier is just some kind of pointer in the datastructure that comprises the rule set. Rules for CSTlemma must be created with the affixtrain program (https://github.com/kuhumcst/affixtrain), but ready-made rules can be obtained from the net. For example, the https://github.com/kuhumcst/texton-linguistic-resources repo contains rules for about 30 languages. If you want to build CSTlemma, you not only need the source code contained in https://github.com/kuhumcst/cstlemma, but also some source code files from https://github.com/kuhumcst/letterfunc and from https://github.com/kuhumcst/parsesgml, The easiest and best way to go forward is to copy https://github.com/kuhumcst/cstlemma/blob/master/doc/makecstlemma.bash to a (linux, Mac?) folder and run that script. That will fetch all needed repositories and build cstlemma.

Use "CSTlemma version 8.1.2"
MCSQ Translation Models (en-de) (v1.0)

2 resources

En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->de: 67.5 (train: genuine in-domain MCSQ data only) de->en: 75.0 (train: additional in-domain backtranslated MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "MCSQ Translation Models (en-de) (v1.0)"
Depfix: Automatic Post-editing of SMT

4 resources

Depfix, a tool for Automatic Post-editing of SMT. See the project website for more information.

Use "Depfix: Automatic Post-editing of SMT"
The CLASSLA-StanfordNLP model for JOS dependency parsing of standard Slovenian 1.0

3 resources

The model for JOS dependency parsing of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the JOS-parsed portion of the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204). The estimated LAS of the parser is ~93.5.

Use "The CLASSLA-StanfordNLP model for JOS dependency parsing of standard Slovenian 1.0"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Categorial Parser

Binary Error Classifier for Icelandic Sentences (22.09)

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Macedonian 1.1

Translation Models (en-de) (v1.0)

Error Classifier (Icelandic Error Corpus categories) for Tokens (22.05)

Tagger SentiOne - version 1

CSTlemma version 8.1.2

MCSQ Translation Models (en-de) (v1.0)

Depfix: Automatic Post-editing of SMT

The CLASSLA-StanfordNLP model for JOS dependency parsing of standard Slovenian 1.0