CLARIN Tool Portal

Active filters:

Availability: Public
Tool task: Word embeddings

7 record(s) found

Search results

Word embeddings CLARIN.SI-embed.mk 2.0

3 resources

CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 933,231,582 tokens of running text for 986,670 lowercased surface forms. The difference to the previous version of the embeddings is that this version was trained on the original dataset expanded with the MaCoCu-mk web crawl corpus (http://hdl.handle.net/11356/1512).

Use "Word embeddings CLARIN.SI-embed.mk 2.0"
Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

3 resources

The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool Compared with the previous version (1.0), this version was trained for further 61 epochs (v1.0 37 epochs, v2.0 98 epochs), for a total of 200,000 iterations/updates. The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers (sloberta.2.0.transformers.tar.gz) or fairseq library https://github.com/pytorch/fairseq (sloberta.2.0.fairseq.tar.gz)

Use "Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0"
Word embeddings CLARIN.SI-embed.mk 0.1

3 resources

CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 323,158,626 tokens of running text for 448,182 lowercased surface forms.

Use "Word embeddings CLARIN.SI-embed.mk 0.1"
Word Embeddings – Word2Vec optimized for IceBATS 22.04

2 resources

Word Embeddings - Word2Vec optimized for IceBATS 22.04 contains two word embedding models, induced from the IGC (http://hdl.handle.net/20.500.12537/192). The word embedding models are optimized to obtain a high average score as measured by IceBATS ( http://hdl.handle.net/20.500.12537/120). One model is trained on lemmatized data and the other on unlemmatized data.

Use "Word Embeddings – Word2Vec optimized for IceBATS 22.04"
CroSloEngual BERT

4 resources

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (ie. to be used with pytorch library).

Use "CroSloEngual BERT"
TimeAssign

1 resources

TimeAssign is a program which recognizes temporal expressions and assigns TimeML labels to words in Polish text using a Bi-LSTM based neural net and wordform embeddings.

Use "TimeAssign"
CroSloEngual BERT 1.1

4 resources

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). Changes in version 1.1: fixed vocab.txt file, as previous verson had an error causing very bad results during fine-tuning and/or evaluation.

Use "CroSloEngual BERT 1.1"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Word embeddings CLARIN.SI-embed.mk 2.0

Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

Word embeddings CLARIN.SI-embed.mk 0.1

Word Embeddings – Word2Vec optimized for IceBATS 22.04

CroSloEngual BERT

TimeAssign

CroSloEngual BERT 1.1