CLARIN Tool Portal

Slavic Forest, Norwegian Wood (models)

5 resources

Trained models for UDPipe used to produce our final submission to the Vardial 2017 CLP shared task (https://bitbucket.org/hy-crossNLP/vardial2017). The SK model was trained on CS data, the HR model on SL data, and the SV model on a concatenation of DA and NO data. The scripts and commands used to create the models are part of separate submission (http://hdl.handle.net/11234/1-1970). The models were trained with UDPipe version 3e65d69 from 3rd Jan 2017, obtained from https://github.com/ufal/udpipe -- their functionality with newer or older versions of UDPipe is not guaranteed. We list here the Bash command sequences that can be used to reproduce our results submitted to VarDial 2017. The input files must be in CoNLLU format. The models only use the form, UPOS, and Universal Features fields (SK only uses the form). You must have UDPipe installed. The feats2FEAT.py script, which prunes the universal features, is bundled with this submission. SK -- tag and parse with the model: udpipe --tag --parse sk-translex.v2.norm.feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu A slightly better after-deadline model (sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe), which we mention in the accompanying paper, is also included. It is applied in the same way (udpipe --tag --parse sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu). HR -- prune the Features to keep only Case and parse with the model: python3 feats2FEAT.py Case < hr-ud-predPoS-test.conllu | udpipe --parse hr-translex.v2.norm.Case.w2v.trainonpred.udpipe NO -- put the UPOS annotation aside, tag Features with the model, merge with the left-aside UPOS annotation, and parse with the model (this hassle is because UDPipe cannot be told to keep UPOS and only change Features): cut -f1-4 no-ud-predPoS-test.conllu > tmp udpipe --tag no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe no-ud-predPoS-test.conllu | cut -f5- | paste tmp - | sed 's/^\t$//' | udpipe --parse no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe

Use "Slavic Forest, Norwegian Wood (models)"

Terminal-based CoNLL-file viewer, v2

4 resources

A simple way of browsing CoNLL format files in your terminal. Fast and text-based. To open a CoNLL file, simply run: ./view_conll sample.conll The output is piped through less, so you can use less commands to navigate the file; by default the less searches for sentence beginnings, so you can use "n" to go to next sentence and "N" to go to previous sentence. Close by "q". Trees with a high number of non-projective edges may be difficult to read, as I have not found a good way of displaying them intelligibly. If you are on Windows and don't have less (but have Python), run like this: python view_conll.py sample.conll For complete instructions, see the README file. You need Python 2 to run the viewer.

Use "Terminal-based CoNLL-file viewer, v2"

UDify Pretrained Model

3 resources

Pretrained model weights for the UDify model, and extracted BERT weights in pytorch-transformers format. Note that these weights slightly differ from those used in the paper.

Use "UDify Pretrained Model"

Biaffine-based UD Parser 22.10

2 resources

ENGLISH: This Universal Dependencies parser for Icelandic was trained with Diaparser [1] on IcePaHC [2] and UD_Icelandic-Modern [3], the latter one having been revised before training, as some duplicate sentences had to be removed. The parser utilizes information from an ELECTRA language model [4]. Its UAS (unlabeled attachment score) is 89.52 and its LAS (labeled attachment score) is 86.23.

Use "Biaffine-based UD Parser 22.10"

COMBO-based UD Parser 22.10

1 resources

ENGLISH: This Universal Dependencies parser for Icelandic was trained with COMBO on IcePaHC and UD_Icelandic-Modern, the latter one having been revised before training, as some duplicate sentences had to be removed. It utilizes information from an ELECTRA language model (https://huggingface.co/jonfd/electra-base-igc-is). Its UAS (unlabeled attachment score) is 89.13 and its LAS (labeled attachment score) is 85.97.

Use "COMBO-based UD Parser 22.10"

Slavic Forest, Norwegian Wood (scripts)

11 resources

Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971). For each source (SS, e.g. sl) and target (TT, e.g. hr) language, you need to add the following into this directory: - treebanks (Universal Dependencies v1.4): SS-ud-train.conllu TT-ud-predPoS-dev.conllu - parallel data (OpenSubtitles from Opus): OpenSubtitles2016.SS-TT.SS OpenSubtitles2016.SS-TT.TT !!! If they are originally called ...TT-SS... instead of ...SS-TT..., you need to symlink them (or move, or copy) !!! - target tagging model TT.tagger.udpipe All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017 You also need to have: - Bash - Perl 5 - Python 3 - word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014 - udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017 - Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016 The most basic setup is the sl-hr one (train_sl-hr.sh): - normalization of deprels - 1:1 word-alignment of parallel data with Monolingual Greedy Aligner - simple word-by-word translation of source treebank - pre-training of target word embeddings - simplification of morpho feats (use only Case) - and finally, training and evaluating the parser Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in specific cases (see paper for details). Moreover, cs-sk also adds more morpho features, selecting those that seem to be very often shared in parallel data. The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.

Use "Slavic Forest, Norwegian Wood (scripts)"

COMBO-based UD Parser for Icelandic 22.12

6 resources

ENGLISH: This Universal Dependencies parser for Icelandic was trained with COMBO [1]. This version of it was trained on v2.11 of UD_Icelandic-IcePaHC [2] and UD_Icelandic-Modern [3]. (Note that texts in UD_Icelandic-Modern [3] labeled RUV_TGS_2017 and RUV_ESP_2017 were not included here as these were originally parsed with COMBO-based UD Parser 22.10 [4] and the output subsequently corrected.) The parser utilizes information from an ELECTRA language model [4]. Its UAS (unlabeled attachment score) is 88.80 (89.00 on a pre-tokenized text file) and its LAS (labeled attachment score) is 85.52 (85.71 if pre-tokenized). ICELANDIC: Þessi UD-þáttari var þjálfaður með COMBO [1]. Hann var þjálfaður á útgáfu 2.11 af UD_Icelandic-IcePaHC [2] og UD_Icelandic-Modern [3]. (Ath. að textar í UD_Icelandic-Modern [3] merktir RUV_TGS_2017 og RUV_ESP_2017 voru ekki notaðir við þjálfunina þar sem þeir voru upphaflega þáttaðir með COMBO-based UD Parser 22.10 [4] og úttakið leiðrétt að því loknu.) Þáttarinn nýtir sér upplýsingar úr ELECTRA-mállíkani [5]. Hann skorar 88.80 (89.00 á fortókuðu skjali) á UAS (unlabeled attachment score) og 85.52 (85.71 á fortókuðu skjali) á LAS (labeled attachment score). [1] COMBO: https://gitlab.clarin-pl.eu/syntactic-tools/combo/ [2] UD_Icelandic-IcePaHC: https://github.com/UniversalDependencies/UD_Icelandic-IcePaHC/ [3] UD_Icelandic-Modern: https://github.com/UniversalDependencies/UD_Icelandic-Modern/ [4] COMBO-based UD Parser 22.10: http://hdl.handle.net/20.500.12537/272 [5] electra-base-igc-is: https://huggingface.co/jonfd/electra-base-igc-is

Use "COMBO-based UD Parser for Icelandic 22.12"

Dependency tree extraction tool STARK 1.0

2 resources

STARK is a python-based command-line tool for extraction of dependency trees from parsed corpora, aimed at corpus-driven linguistic investigations of syntactic phenomena of various kinds. It supports the CONLL-U format (https://universaldependencies.org/format.html) as input and returns a list of all relevant dependency trees, frequencies, and other associated information in the form of a tab-separated .tsv file. For installation, execution and the description of various user-defined parameter settings, see the official project page at: https://gitea.cjvt.si/lkrsnik/STARK. This entry corresponds to commit 421f12cac6 in the Git repository.

Use "Dependency tree extraction tool STARK 1.0"

Dependency tree extraction tool STARK 2.0

2 resources

STARK is a python-based command-line tool for extraction of dependency trees from parsed corpora, aimed at corpus-driven linguistic investigations of syntactic and lexical phenomena of various kinds. It takes a treebank in the CONLL-U format as input and returns a list of all relevant dependency trees with frequency information and other useful statistics, such as the strength of association between the nodes of a tree, or its significance in comparison to another treebank. For installation, execution and the description of various user-defined parameter settings, see the official project page at: https://github.com/clarinsi/STARK In comparison with v1, this version introduces several new features and improvements, such as the option to set parameters in the command line, compare treebanks or visualise results online.

Use "Dependency tree extraction tool STARK 2.0"

UDConverter for GreynirCorpus 22.06

2 resources

UDConverter for GreynirCorpus is a tool for converting the constituency treebanks of GreynirCorpus (http://hdl.handle.net/20.500.12537/119) to dependency treebanks following the Universal Dependencies framework. The tool has been adapted to the GreynirCorpus annotation scheme, from the original UDConverter (http://hdl.handle.net/20.500.12537/169).

Use "UDConverter for GreynirCorpus 22.06"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Slavic Forest, Norwegian Wood (models)

Terminal-based CoNLL-file viewer, v2

UDify Pretrained Model

Biaffine-based UD Parser 22.10

COMBO-based UD Parser 22.10

Slavic Forest, Norwegian Wood (scripts)

COMBO-based UD Parser for Icelandic 22.12

Dependency tree extraction tool STARK 1.0

Dependency tree extraction tool STARK 2.0

UDConverter for GreynirCorpus 22.06