Result filters

Metadata provider

Language

Resource type

Availability

Loading...
698 record(s) found

Search results

  • TTS Text Processing (22.10)

    ENGLISH: This project provides a TTS textprocessing pipeline for Icelandic. The pipeline includes modules for html parsing, text cleaning, text normalization for TTS, spell and grammar correction, phrasing, and grapheme-to-phoneme (g2p) conversion. Before a text can be fed into a TTS system it has to be converted into the format that was used when training that system. The format can be grapheme-based (i.e. alphabetic characters of the language in question are used as input) or phoneme-based (i.e. a phonetic alphabet like IPA or SAMPA are used as input). The TTS Textprocessing Pipeline for Icelandic offers both possibilities. ÍSLENSKA: Þessi hugbúnaðarpakki inniheldur textavinnslupípu fyrir íslenska talgervla. Pípan samanstendur af vinnslu html-skjala fyrir hljóðbækur, hreinsun texta, textanormun, stafsetningarleiðréttingu, innsetningu á þögnum og sjálfvirkri hljóðritun. Áður en hægt er að senda texta á talgervil þarf að forvinna hann, t.d. skrifa út tölustafi og skammstafanir, merkja inn þagnir og koma textanum að lokum á sama form og þjálfunargögn þess talgervils sem á að lesa textann. Yfirleitt eru talgervlar þjálfaðir á hljóðrituðum textum, þar sem textarnir eru hljóðritaðir skv. hljóðritunarstafrófum eins og IPA eða SAMPA, en einnig geta þeir verið þjálfaðir beint á textum skrifuðum með hefðbundnum bókstöfum. Textavinnslupípan býður upp á báða möguleika og einnig að vinna textann einungis að hluta.
  • The Trankit model for linguistic processing of written and spoken Slovenian 1.2

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15), thus producing significantly better results for spoken data. In contrast to the previous versions of this model (1.0, 1.1), the model 1.2 was trained on a new SST train-dev-test split introduced in UD v2.15.
  • ELMo embeddings models for seven languages

    ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level. Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library.
  • VU University Diachronic News text Corpus

    The diachronic corpus has been brought in line with current standards and formats as used in the STEVIN Nederlandstalig Referentiecorpus (SoNaR, under development), which has been adapted to the more general FoLiA format (documented by Van Gompel, 2012). These standards and formats have been extended with new layers of annotation. As a result the corpus adheres to the current day CLARIN infrastructure.
  • OpenSONAR: a 500 MW reference corpus of Contemporary Written Dutch

    SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. SONAR contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types including both texts from conventional media and texts from the new media. All texts except for texts from the social media (Twitter, Chat, SMS) have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. All annotations were produced automatically, no manual verification took place. The texts are enriched with several annotations (Part of Speech and lemma information) and are available as FoLiA xml files (folia.xml). The system relies on BlackLab server as back-end and WhiteLab as user-interface. OpenSONAR is an online application for exploration of and searching in the SoNaR corpus.
    van de Camp, M, Reynaert,MandOostdijk, N. 2017.WhiteLab 2.0: AWeb Interface for Corpus Exploitation. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 231–243. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.19. License: CC-BY 4.0
    de Does, J, Niestadt, J and Depuydt, K. 2017. Creating Research Environments with BlackLab. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 245–257. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.20. License: CC-BY 4.0
    Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I. (2013) The Construction of a 500 Million Word Reference Corpus of Contemporary Written Dutch in: Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme (eds. P. Spyns, J. Odijk), Springer Verlag.
  • CLARIN Concept Registry

    The CCR is a concept registry according to the W3C SKOS recommendation. It was chosen by CLARIN to serve as a semantic registry to overcome semantic interoperability issues with CMDI metadata and different annotation tag sets used for linguistic annotation. The CCR is part of the CMDI metadata infrastructure. The W3C SKOS recommendation, and the OpenSKOS implementation thereof, provides the means for ‘data-sharing, bridging several different fields of knowledge, technology and practice’. According to this model, each concept is assigned a unique administrative identifier, together with information on the status or decision-making process associated with the concept. In addition, concept specifications in the CCR contain linguistic descriptions, such as definitions and examples, and can be associated with a variety of labels. .
  • CMDI to RDF conversion

    There is growing amount of on-line information available in RDF format as Linked Open Data (LOD) and a strong community very actively promotes its use. The publication of information as LOD is also considered an important signal that the publisher is actively searching for information sharing with a world full of new potential users. Added advantages of LOD, when well used, are the explicit semantics and high interoperability. But the problematic modelling by non-expert users offsets these advantages, which is a reason why modelling systems as CMDI are used. The CMDI2RDF project aims to bring the LOD advantages to the CMDI world and make the huge store of CMDI information available to new groups of users and at the same time offer CLARIN a powerful tool to experiment with new metadata discovery possibilities. The CMD2RDFservice was created to allow connection with the growing LOD world, and facilitate experiments within CLARIN merging CMDI with other, RDF based, information sources. One of the promises of LOD is the ease to link data sets together and answer queries based on this ‘cloud’ of LOD datasets. Thus in the enrichment and use cases part of the project we looked at other datasets to link to the CLARIN joint metadata domain. We used the WALS N3 RDF dump for one of the use cases. Although it is in the end relatively easy to go from a specific typological feature to the CMD records via a shared URI, it still showcased a weakness of the Linked Data approach. One has to carefully inspect the property paths involved. And in this case the path was broken as there was no clear way to go from the WALS feature data to the WALS language info except for extracting the WALS language code from the feature URI pattern and insert it the language URI pattern. This showcases that although the big LOD cloud shows potential for knowledge discovery by crossing dataset boundaries, design decisions in the individual datasets can still hamper algorithms and manual inspection is needed. The CMD2RDF service was developed at the TLA/MPI for Psycholinguistics and DANS and later moved to Meertens Institute where the expertise remains.
  • Assamese POS Tagger

    Assamese POS tagger is a CRF++ based POS Tagger. CRF++ is a customizable open source Conditional Random Fields for tagging/labeling continuos text. CRF++ is implemented for generic purpose and can be applied to any natural language provided the tagset. CRF++ tool is designed in C++ language. ------- 1. These Assamese NLP resources including the Tools and Applications are developed during Research and Development Projects as well as Masters and Ph.D. thesis works. 2. These are mainly developed or generated at Gauhati University Department of Computer Science and Department of Information Technology. 3. These resources are used by students and researchers for further studies, researches, as well as for design and development of tools and applications. 4. Computational Linguistics in Assamese is not rich, and Natural Language Processing works have mainly started during last two decades, and most of the resources are first generation resources, and with ample scope for upgrading, enriching, and purifying. 5. These are very good and essential resources for all the researchers in Assamese NLP, as the language requires more and more NLP works to make Assamese a rich media for the digital world. 6. Anyone interested, or in need of such resources may express their interest for the required resources, and the way of availability will be advised/informed accordingly. 7. These are purely research materials and could only be used for further research only. 8. Researchers may visit the NLP Lab of Department of Information Technology, Gauhati University, Guwahati, India or contact us. 9. Researchers interested in collaborative works, and also students for project works, are welcome. 10. Contact person is Professor Shikhar Kr. Sarma, Department of Information Technology, Gauhati University, Guwahati 781014, Assam, India. Email- sks@gauhati.ac.in