CLARIN Tool Portal

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

1 resources

A demo of a speech recognizer for POIs (Points of Interest). This demo recognizes stay-over addresses and eateries in some big cities (inter alia Amsterdam, Antwerpen, Gent, Rotterdam).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

Een demo van een spraakherkenner voor POIs (Points of Interest). Deze demo herkent overnachtingsadressen en eetgelegenheden in enkele grote steden (o.a. Amsterdam, Antwerpen, Gent, Rotterdam).

RemBench - a Digital Workbench for Rembrandt Research

1 resources

RemBench enables one to search and browse for works of art, artists, primary sources and library sources related to Rembrandt, using faceted search by location, author/artist name, author/artist type, and date range, and/or by both exact and fuzzy keyword search. It offers both a web application and a RESTful web service. RemBench combines the content of four different databases behind one search interface: RKDartists and RKDimages, two databases maintained by the Netherlands Institute for Art History (RKD); RemDoc, a collection of original documents related to Rembrandt van Rijn from the period between 1475 to circa 1750; RUQuest, a library system that provides access to full text articles, as well as the complete collection of (e-)books and journals from the Radboud University Library Catalogue. RemBench does not influence the content of these databases.

Verberne, S, van Leeuwen, R, Gerritsen, G and Boves, L. 2017. RemBench: A Digital Workbench for Rembrandt Research. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 337–350. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.28. License: CC-BY 4.0

Namescape Barcode Browser

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

user documentation

1 resources

Search Application into the collection of all 13th century texts that served as source material for the Early Middle Dutch Dictionary. The source files are available upon request.

Zoekapplicatie op de verzameling van alle 13e-eeuwse teksten die als bronnenmateriaal hebben gediend voor het Vroegmiddelnederlands Woordenboek. De bronbestanden van het corpus zijn op aanvraag beschikbaar

Use "user documentation"

DuELME: Search interface to the Dutch Electronic Lexicon of Multiword Expressions

1 resources

The DuELME search interface provides access to the DUELME electronic lexicon, which contains more than 5,000 Dutch multiword expressions (MWEs). MWEs with the same syntactic pattern are grouped in the same equivalence class. The search interface enables users to search for MWEs on the basis of a range of syntactic and semantic criteria, among them expression, pattern id, written form, type, conjugation, polarity, parameters, form, etc. Extensive documentation on the structure of the database is available. DuELME (Dutch Electronic Lexicon of Multiword Expressions) is one of the results of the project Identification and Representation of Multiword Expressions (IRME). The lexical descriptions boast to be highly theory- and implementation-neutral. The DUELME LMF lexicon is suitable for theoretical research on multiword expressions as for use in NLP systems. The DuELME-LMF project has been carried out within the CLARIN-NL programme.

Grégoire, N. (2009), Untangling Multiword Expressions. A study on the representation and variation of Dutch multiword expressions, PhD thesis, University of Utrecht.

CMDI Registry/editor

1 resources

The CMDI Registry/Editor allows metadata modelers and editors to share and reuse CMDI metadata schema or build new ones (partly) based on previous work. CMDI is the metadata framework adopted by CLARIN. Metadata for language resources and tools exists in a multitude of formats. Often these descriptions contain specialized information for a specific research community (e.g. TEI headers for text, IMDI for multimedia collections). To overcome this dispersion CLARIN has initiated the Component MetaData Infrastructure (CMDI). It provides a framework to describe and reuse metadata components that are pieces of the completemetadata schema. Such components can be grouped into a ready-made description format (a “profile”). Both are stored and shared with other users in the Component Registry to promote reuse. Such metadata profiles (equivalent to a metadata schema) can be used to instantiate metadata descriptions that describe language resources. The CMDI approach combines architectural freedom when modeling the metadata with powerful exploration and search possibilities over a broad range of language resources. The CMDI Registry development was supported by both the Dutch and German national CLARIN projects.

Windhouwer, M, Indarto, E and Broeder, D. 2017. CMD2RDF: Building a Bridge from CLARIN to Linked Open Data. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 95–103. London:Ubiquity Press. DOI: https://doi.org/10.5334/bbi.8. License: CC-BY 4.0

TQE: Transcription Quality Evaluation

1 resources

The Transcription Quality Evaluation (TQE) tool is an instrument that automatically evaluates the quality of phonetic transcriptions. The application makes it possible to upload pairs of files consisting of an audio file and a transcription file and process them as follows: the audio signal and the phonetic transcription are aligned, segment boundaries are derived for each phone, and for each segment-phone combination it is determined how well they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, a number ranging from 0-100%, indicating how good the fit is, i.e. the quality of the phone transcription. The higher the number, the better the fit. The output of the TQE tool consists of a TQE measure and the segment boundaries for each phone in the corpus. The TQE tool thus makes it possible to find (sequences of) segments for which the match of the phone symbols with the audio signal is not optimal, in other words, the TQE tool can be used to check the quality of phonetic transcriptions. This can be useful for validating (manual) phonetic transcriptions, but also to compare and select (‘competing’) transcriptions, e.g. to study pronunciation variation. The TQE tool can thus be usefully applied in all research – in various (sub-) fields of humanities and language and speech technology (L&ST) – in which audio and phonetic transcriptions are involved.

ePistolarium: A Web-based Humanities’ Collaboratory on Correspondences

1 resources

Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic (CKCC) investigates the circulation of knowledge in the 17th-century Dutch Republic. A multi-disciplinary project team consisting of historians, literature researchers, linguists and computer scientists works together in this project and created a web-based Humanities’ Collaboratory on Correspondences. This project, is carried out thanks to a NWO Medium investment subsidy and with CLARIN subsidies to make the resources available withing the CLARIN domain. A consortium of Dutch universities and cultural heritage institutions is building a web-based collaboratory (an online space for asynchronous collaboration) around a corpus of 20.000 letters of scholars who lived in the 17th-century Dutch Republic to answer the research question: how did knowledge circulate in the 17th century? Hereto, it will be necessary to analyze this large amount of correspondence systematically. Based on this (extendable) corpus, we will implement a content processing workflow that consists of iterative cycles of conceptual analysis, enrichment with several layers of annotation and visualization. With advice from CLARIN-EU in the first stage of the project a demonstrator was developed which implements techniques of keyword extraction. The second stage consists of evaluating existing more complex tools en techniques that can tackle one or more aspects of the targeted grammatical, content-related, and network complexity analysis, annotation, and visualization. The phase shall identify a set of tools that can be readily utilized in CKCC, as well as tools that need to be adapted or extended to the needs of CKCC; in short, by the end of this phase resources, requirements and risks shall become clear (deadline: December 2010). In the third stage the collaboratory is further developed according to the description in the CKCC project goals, centering around the technique of concept extraction. These three stages constitute the Work Package Analysis Tools, the core of the CKCC project, which was supported by CLARIN-NL. Other Work Packages provide data and software tools needed to create a complete system: the digital corpus of letters (WP6), the editing collaboratory that will contain the letters (WP1), and the archiving environment for data and software (WP2).

Ravenek, W, van den Heuvel, C and Gerritsen, G. 2017. The ePistolarium: Origins and Techniques. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 317–323. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.26. License: CC-BY 4.0

PILNAR: Pilgrimage Narratives - a corpus for studying the profile of the modern pilgrim

1 resources

A corpus of pilgrimage narratives with Dutch texts written after ca. 2000 that present the thoughts and impressions of pilgrims to Santiago de Compostela. The PILNAR corpus is a source for research for a variety of (sub)disciplines: culture studies, ritual and religious studies, but also media and e-culture studies (cf the use of blogs and other social media for the self-presentation of experiences). Only for authorized users. The PILNAR corpus contains six subcorpora: - Volumes of De Jacobsstaf 1986-: 84 pdf files; - Volumes of De Pelgrim of the Flemish Society of Santiago de Compostella nos. 1-4 (16mb, 10mb, 16mb) (both societies work collaborate closely); - Volumes of Ultreia, a newsletter; 3 issues available now: January, February, April 2011; - Pilgrimage accounts and blogs by pilgrims available via the Societies Netherlands: circa 140 files; Flemish: circa 138 files; - A corpus of pilgrimage narratives compiled on the occasion of the exhibition in Museum Catharijneconvent held in collaboration with the Society: www.pelgrimsverhalen.nl; (link is external) already on the site now: about 180 fields (as of July 2011); - Accounts and narratives that come in after a specially targeted notice via the site and periodical by the Society (De Jacobsstaf), with perhaps a Flemish companion piece (De Pelgrim).

BNM-I: Linked Data on Middle Dutch Sources Kept Worldwide

1 resources

Web application for consultation, using facetted search, and collaborative editing of the curated e-BNM collection of textual, codicological and historical information about thousands of Middle Dutch manuscripts kept world wide.The Bibliotheca Neerlandica Manuscripta and Impressa collects and makes available information on medieval manuscripts produced in the Netherlands regardless where they are kept. Documentation activities concentrate on the Middle-Dutch texts and their authors that have been transmitted in these manuscripts, on the individuals and institutions that have been involved in the manuscript production (scribes, illuminators, monasteries) and on the former and present manuscript owners. Since 1991 two-thirds of this ‘paper’ information, checked and supplemented with information from recent publications, has been converted into electronic data and incorporated in a database ( BNM-I ), which can be searched online. In 2013 this database was converted in the e-BNM+ project into a flexible datastructure that turned BNM-I into a key open access resource to which many other resources can easily be linked. The new BNM-I: - will be freely accessible for every user, anywhere in the world; - can easily implement new contributions or corrections by scientists; - can easily be linked to related databases - in the near future cross searching several databases in one interface will be possible; - will be prepared for the inclusion of new data, like: research data on Middle Dutch texts that were printed before 1541 and the books in which they are preserved; - articles on Middle Dutch texts and their authors (associated with the current thesaurised information).

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

RemBench - a Digital Workbench for Rembrandt Research

Namescape Barcode Browser

user documentation

DuELME: Search interface to the Dutch Electronic Lexicon of Multiword Expressions

CMDI Registry/editor

TQE: Transcription Quality Evaluation

ePistolarium: A Web-based Humanities’ Collaboratory on Correspondences

PILNAR: Pilgrimage Narratives - a corpus for studying the profile of the modern pilgrim

BNM-I: Linked Data on Middle Dutch Sources Kept Worldwide