Result filters

Metadata provider

Language

Resource type

  • Unspecified

Organisation

  • Vytautas Magnus University

Active filters:

  • Resource type: Unspecified
  • Organisation: Vytautas Magnus University
Loading...
8 record(s) found

Search results

  • Lithuanian speech-to-text Transcriber

    Speech to text automatic transcriber for Lithuanian is a containerized application implemented into 17 containers. It covers four areas: administrative, legal, medical and general spoken language. For the installation of Transcriber, we recommend the installation of Docker and Docker Compose. Demo service is provided at https://semantika.lt/Analysis/Transcriber, while IT solutions can be found at https://semantika.lt/Help/Info/Solutions. The transcription result is a set of files containing the same, but differently formatted information: plain text, a WebVTT format file (for subtitling purposes) and a file with data about transcription synchronization with the audio record. This latter file is intended for convenient editing of an audio-synchronized transcription. A transcription editor for this purpose can be found (http://semantikadocker.vdu.lt/files/transcription-editor-multi.zip).
  • The Database of Lithuanian multiword expressions

    The Database of Lithuanian multiword expressions (MWEs) is freely accessible for online search at: https://resursai.pastovu.vdu.lt/paieska/paprastoji from 2019. It contains two-word and three-word MWEs extracted from the DELFI.lt corpus representing news texts on the various topics (https://klc.vdu.lt/pastovuSearch.html). First, 12,000 MWEs (mostly collocations, a few idioms) were included in the database. In 2022, the database was updated adding new collocations from the same corpus and filtering arbitrary collocations: out of appr. 19,000 collocations appr. 9000 are marked as arbitrary collocations, i.e., having lexical collocability restrictions. The database provides rich information about the usage of collocations: lemma, word forms, frequencies (in the DELFI.lt corpus), morphological information, syntactic relations, grammatical variants, text genres, and usage examples. Usage variation cases are also illustrated, for example, word order changes or insertions between collocation constituents.
  • Lithuanian keyboard for macOS users

    This keyboard driver allows easy access of the Lithuanian letters via conventional keyboard layout a.k.a. „Lithuanian letters instead of numbers“. Essential new feature of this layout is the extensive use of "dead key" technique to type the following single letters: • Lithuanian accented (ą̃, ū́, m̃, ė́ etc.); • Latvian; • Estonian; • Polish; • French; • German; • Scandinavian; • old Greek; • Russian.
  • DIGIRES COVID-19 ML Dataset v.1

    DIGIRES COVID-19 ML dataset v.1 is a tab-separated (.tsv) file prepared for training machine learning algorithms. The training dataset was compiled from various internet public Lithuanian media sources. It contains 351 records and has the following attributes: "Title": the title of a news article "Text": the text of the article "Label": a label that marks the article as 1: unreliable; 0: reliable 1) "unrealiable" marks articles, which were identified by professional fact checkers as fake news; 2) "reliable" marks trustworthy articles. Classes Labels Word tokens Reliable: 175 67902 Unreliable: 176 118747 Total 351 186649
  • Pedagogic Corpus of Lithuanian

    The Pedagogic Corpus of Lithuanian is a monolingual specialized corpus, prepared for learning and teaching Lithuanian in a foreign language classroom. The pedagogic corpus includes authentic Lithuanian texts, selected using such criteria as a learner-relevant communicative function and genre. Spoken language as well as written language are represented in the corpus. The size of the corpus is 669,000 tokens: 111,000 tokens from texts and spoken language for A1-A2 levels, 558,000 tokens from texts and spoken language for B1-B2 levels (according to the Common European Framework of Reference for Languages). The spoken component constitutes appr. 7.5 % of the Corpus. The written subpart of the corpus (containing 620,000 tokens) includes levelled texts from coursebooks and unlevelled texts from other sources. The texts from coursebooks and other sources could be classified into 29 text types (dialogs, narratives, information, etc.) and 4 groups according to the communicative aims: informational texts, educational texts, advertising and fiction. There are two types of searches in the corpus: simple and advanced (see „Search Tips“). Simple Search allows you to find instances of a search item (word form, lemma, two words) in the whole corpus, or particular part of the corpus (spoken or written texts). After selecting the written subcorpus, you can further select the text type (coursebooks or non-coursebook texts) and/or the genre of the written texts. Advanced Search allows you to use all the features of simple search and find some additional options. Since the Pedagogic Corpus is morphologically annotated, the advanced search allows you to search by grammatical features (e.g. part of speech, case, number, verb form, etc.). At https://kalbu.vdu.lt/mokymosi-priemones/mokomasis-tekstynas/ you can find truncated wordlists: list of lemmas, word forms (for the whole corpus, spoken and written components, and for each level), lists of particular part of speech in the whole corpus. The lists can be downloaded as .xlsx files. REFERENCE Kovalevskaitė, Jolanta and Rimkutė, Erika. "Pedagogic Corpus of Lithuanian: A New Resource for Learning and Teaching Lithuanian as a Foreign Language" Sustainable Multilingualism, vol.17, no.1, 2020, pp.197-230. https://doi.org/10.2478/sm-2020-0019