CLARIN Tool Portal

703 record(s) found

Search results

MT: Moses-SMT (1.0)

5 resources

Moses phrase-based statistical machine translation (Moses PBSMT) is a system which is used to develop and run machine translation models. It is distributed here as four packages: 1. Code from a github repository to train and run models. 2. Pretrained is-en system (Docker) 3. Pretrained en-is system (Docker) 4. Frontend to pre- and postprocess text for translation (Docker) The models here are not (exactly) the same as were used for human evaluation. These models have additionally been trained on open dictionaries to extend their vocabularies. Moses phrase-based statistical machine translation (Moses PBSMT) er kerfi til þess að þróa og keyra tölfræðilegar vélþýðingar. Hér er dreift fjórum pökkum: 1. Kóða af github geymslusvæði fyrir þjálfun og keyrslu á líkönum 2. Forþjálfuðu is-en vélþýðingarlíkani (Docker) 3. Forþjálfuðu en-is vélþýðingarlíkani (Docker) 4. Framenda til að for- og eftirvinna texta fyrir þýðingar (Docker) Líkönin sem eru sett hér eru ekki (nákvæmlega) þau sömu og voru notuð við mannlegt mat. Þessi líkön hafa aukalega verið þjálfuð á gögnum úr opnum orðabókum til þess að auka orðaforða.

Use "MT: Moses-SMT (1.0)"
Periphraser

2 resources

Periphraser is a tool for storing and presenting knowledge base of conventionalized periphrastic nominal expressions (i.e. phrases headed by a noun) together with their textually attested realizations. For instance, the database entry for the phrase ,,Robert Lewandowski'' in the demo for Polish will include the phrase ,,the Polish international'' while ,,pediatrics'' will be featured as ,,medical care for children''. It allows contacting with database using REST API as well as exporting it to XML or CSV format. For Polish language, it also provides some more complex mechanisms like: automatic semantic and syntactic normalization, errors autodetection (also based on NKJP frequency and amount of results returned by the web browser), and simple interface for commenting and marking possibly wrong entries or ones needing improvement.

Use "Periphraser"
Long term archive operating system source code

2 resources

This submission contains the operating system of the long-term archive, built in the Polish-Japanese Academy of Information Technology for the Clarin-PL project. Basic elements of the archive are data nodes, equipped with mass memories. The nodes are controlled by embedded low-power computers which are independently powered up only when their storage is about to be accessed. This allows not only for limiting the overall energy consumption but also lowers environmental demands (no air-condition needed). The nodes are grouped in trays. Basic and recommended configuration allows for 30 nodes in trays, but it is possible to extend this limit up to 253. Each tray contains several networks designed for data transport, devices’ state control and power supply. Communication with clients is conducted through buffers that are the only parts visible from externally connected networks. Therefore, stored files are completely isolated and cannot be directly accessed. Multiple trays located at single physical site create a complete archive. It is possible to split storage space into virtual archives that are separated on logical level. The operating system of the data network allows to store from 3 to 7 copies of single digital file in different nodes. Moreover, additional copies of the resource may be stored automatically in remotely located archives. The trays are treated as local parts of wider dispersed data network structure. Software of the archive enables not only secure read and write operations data but it also automatically takes care of the stored data. It periodically regenerates physical state of saved files. In case of device failure clients are transparently redirected to local or remote redundant copies. The mechanism of "software bots" was implemented. Archive can be supplied with external programs for processing files stored inside the data network. This allows for data analyzes, indexation, post-data creation, statistical computations or finding associations in unstructured data sets of Big Data type. Only the output of software bot can be externally accessed what makes such operations very secure. Client programs communicate with the archive using set of simple protocols based on key-value pair strings, making it convenient to build web interfaces for archive access and administration. By automating the supervision of the resources, reduction of requirements for storage, precise energy consumption control and proposed solution significantly lowers the cost of long-term data storage.

Use "Long term archive operating system source code"
LitLat BERT

4 resources

Trilingual BERT-like (Bidirectional Encoder Representations from Transformers) model, trained on Lithuanian, Latvian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by fine-tuning the model end-to-end. LitLat BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). The corpora used for training the model have 4.07 billion tokens in total, of which 2.32 billion are English, 1.21 billion are Lithuanian and 0.53 billion are Latvian. LitLat BERT is based on XLM-RoBERTa model and comes in two versions, one for usage with transformers library (https://github.com/huggingface/transformers), and one for usage with fairseq library (https://github.com/pytorch/fairseq). More information is in the readme.txt.

Use "LitLat BERT"
NeMo Conformer CTC BPE E2E Automated Speech Recognition service RSDO-DS2-ASR-E2E-API 1.1

2 resources

Automated Speech Recognition service for NeMo Conformer CTC BPE E2E models. For more details about building such models, see the official NVIDIA NeMo documentation (https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html) and NVIDIA NeMo GitHub (https://github.com/NVIDIA/NeMo). A model for automated speech recognition of Slovene speech can be downloaded from http://hdl.handle.net/11356/1740. The service accepts as input audio files in WAV 16kHz, 16bit PCM, mono format. The maximal accepted audio duration is 300s. Note that transcription of one 300s audio file on cpu will take advantage of all available cores, consume up to 16GB RAM and may take ~180s (on a system with 24 vCPU). See the service README.md for further details.

Use "NeMo Conformer CTC BPE E2E Automated Speech Recognition service RSDO-DS2-ASR-E2E-API 1.1"
Dependency tree extraction tool STARK 3.0

2 resources

STARK is a highly customizable tool designed for extracting different types of syntactic structures (trees) from parsed corpora (treebanks), aimed at corpus-driven linguistic investigations of syntactic and lexical phenomena of various kinds. It takes a treebank in the CONLL-U format as input and returns a list of all relevant dependency trees with frequency information and other useful statistics, such as the strength of association between the nodes of a tree, or its significance in comparison to another treebank. For installation, execution and the description of various user-defined parameter settings, see the official project page at: https://github.com/clarinsi/STARK. An online demo version of the tool is available at: https://orodja.cjvt.si/stark/. In comparison to v2, this version introduces several new features and improvements, such as the ability to extract very long trees, ignore irrelevant relations, process multi-root treebanks, or handle special operators when querying.

Use "Dependency tree extraction tool STARK 3.0"
Webrice extension (22.01)

2 resources

The Webrice plugin is a software add-on that gives access to people to listen to web pages instead of reading them. This chrome browser extension changes Icelandic text to speech. Webrice viðbótin er hugbúnaðarforrit sem hjálpar notendum að velja texta og hlusta á hann í staðinn fyrir að lesa. Þessi Chrome viðbót breytir íslenskan textan í tal.

Use "Webrice extension (22.01)"
ABLTagger (PoS) - 3.0.0

3 resources

A Part-of-Speech (PoS) tagger for Icelandic. In this submission, you will find pretrained models for ABLTagger v3.0.0. In this submission we provide two versions, small and large, of PoS taggers that work with the revised tagset that achieve an accuracy of ~96.7% and ~97.8% on MIM-Gold (cross-validation, excluding "x" and "e" tags), respectively. For installation, usage, and other instructions see https://github.com/icelandic-lt/POS You should also check if a newer version is out (see README.md - versions) on CLARIN: - Model files ------------------------------------------------------------------------------------------- Markari fyrir íslensku. Í þessum pakka er ABLTagger v3.0.0. Í þessari útgáfu eru tvö forþjálfuð líkön, lítið og stórt, sem virka fyrir nýja markamengið og ná 96,7% og 97,8% nákvæmni á MÍM-Gull (krossprófanir, án "x" og "e" marka). Fyrir uppsetningar-, notenda- og aðrar leiðbeiningar sjá https://github.com/icelandic-lt/POS. Einnig er gott að athuga þar hvort ný útgáfa sé komin út (sjá README.md - versions) Á CLARIN: - Gögn fyrir líkan

Use "ABLTagger (PoS) - 3.0.0"
Icelandic TTS for Android (24.04.)

2 resources

The Símarómur application provides an Icelandic TTS application for the Android TTS service. The application provides access to one on-device voice. The app is developed with the needs of the visually impaired in mind, i.e. the voice is lightweight and very fast. Furthermore, Símarómur includes a user dictionary that allows users to define their pronunciation of words and abbreviations. Símarómur er Android app sem gerir notendum kleift að nota íslenskan talgervil í símunum, t.d. sem skjálesara. Ein rödd er í appinu, en appið er sérstaklega miðað að þörfum blindra og sjónskertra, þ.e. röddin er "létt" og mjög hröð. Einnig inniheldur Símarómur orðabók þar sem notendur geta skilgreint eigin framburð á orðum og skammstöfunum.

Use "Icelandic TTS for Android (24.04.)"
Tokenizer for Icelandic text (3.4.1) (2022-05-31)

3 resources

Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer

Use "Tokenizer for Icelandic text (3.4.1) (2022-05-31)"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

MT: Moses-SMT (1.0)

Periphraser

Long term archive operating system source code

LitLat BERT

NeMo Conformer CTC BPE E2E Automated Speech Recognition service RSDO-DS2-ASR-E2E-API 1.1

Dependency tree extraction tool STARK 3.0

Webrice extension (22.01)

ABLTagger (PoS) - 3.0.0

Icelandic TTS for Android (24.04.)

Tokenizer for Icelandic text (3.4.1) (2022-05-31)