Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • CMDI Registry/editor

    The CMDI Registry/Editor allows metadata modelers and editors to share and reuse CMDI metadata schema or build new ones (partly) based on previous work. CMDI is the metadata framework adopted by CLARIN. Metadata for language resources and tools exists in a multitude of formats. Often these descriptions contain specialized information for a specific research community (e.g. TEI headers for text, IMDI for multimedia collections). To overcome this dispersion CLARIN has initiated the Component MetaData Infrastructure (CMDI). It provides a framework to describe and reuse metadata components that are pieces of the completemetadata schema. Such components can be grouped into a ready-made description format (a “profile”). Both are stored and shared with other users in the Component Registry to promote reuse. Such metadata profiles (equivalent to a metadata schema) can be used to instantiate metadata descriptions that describe language resources. The CMDI approach combines architectural freedom when modeling the metadata with powerful exploration and search possibilities over a broad range of language resources. The CMDI Registry development was supported by both the Dutch and German national CLARIN projects.
    Windhouwer, M, Indarto, E and Broeder, D. 2017. CMD2RDF: Building a Bridge from CLARIN to Linked Open Data. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 95–103. London:Ubiquity Press. DOI: https://doi.org/10.5334/bbi.8. License: CC-BY 4.0
  • COAVA: Cognition, Acquisition and Variation Tool

    In COAVA two sets of databases are made available in a standardized way: one with historical dialect data (the databases WBD and WLD with lexical data of the Brabantish and Limburgian dialect between 1880-1980) and one with first language acquisition data (four databases form the CHILDES project). The databases contain linguistic information (dialect form, standardised form (“Dutchified”), lexical meaning), geographical information (locality, dialect area, province) and information on the source (inquiry forms or monotopic dictionaries and the date of documentation). The visualisation of the first two sets of information will lead to lexical maps. The most typical way for the user to get to the data will be with the use of the browsable concept taxonomy. The databases are, in other words, approachable via search tools but also via a thematic taxonomy. This taxonomy was developed for the dialect databases and covers the general vocabulary. COAVA (COgnition, Acquisition and VAriation Tool) brings together two strange bedfellows: first language acquisition and historical dialectology. In historical linguistics there is the common assumption that language change in the past is due to the process of non-target like transmission of linguistic features between generations i.e. between parents and children. Despite this assumption, both disciplines remain isolated from each other due to, among others, different methods of data-collection and different types of resources with empirical data. The aim of the COAVA project was to demonstrate that the common assumption in historical linguistics, mentioned above, can be examined in detail with the help of Digital Humanities. This interdisciplinary research targets at the development of a tool for easily exploring the linguistic characteristics of concepts. In COAVA two sets of databases are made available in a standardized way: one with historical dialect data (the databases WBD and WLD with lexical data of the Brabantish and Limburgian dialect between 1880-1980) and one with first language acquisition data (four databases form the CHILDES project).
    Leonie Cornips, Jos Swanenberg, Wilbert Heeringa, Folkert de Vriend (2016). The relationship between first language acquisition and dialect variation: Linking resources from distinct disciplines in a CLARIN-NL project. Lingua, Vol. 178, 07.2016, p. 32-45. doi:10.1016/j.lingua.2015.11.007
    Cornips, L., Swanenberg, J., Vriend, F. de, Heeringa, W. (2012), Is what we have acquired early, less vulnerable to variation? A comparison between data from dialectlexicography and data from first language acquisition. http://www.meertens.knaw.nl/coavasite/wp-content/uploads/2012/10/Abstract-SIDG-2-JS.pdf
    Cornips, L., Kemps-Snijders, M., Snijders, M., Swanenberg, J. and Vriend, F. de (2011). Bridging the Gap between First Language Acquisition and Historical Dialectology with the Help of Digital Humanities. SDH Copenhagen. http://www.meertens.knaw.nl/coavasite/wp-content/uploads/2011/11/Paper-SDH.pdf
  • COBWWWEB: Connections Between Women and Writings Within European Borders

    The WomenWriters database includes biographical data on more than 4.000 authors and over 22.000 references to reception data found in sources like the periodical press, early literary history and private correspondences. A significant part of the dataset was collected in the NWO digitizing project The International Reception of Women’s Writing (2004-2007), focusing on authors received in the Netherlands. A second NWO internationalising project called New approaches to European Women’s Writing (2007-2010) and the subsequent COST Action Women Writers in History (2009‐2013) brought together a large international community of scholars and used the Dutch data collection as an example for other colleagues. COBWWWEB enables a connection between the various national projects on this subject into a growing international data network. A virtual research environment on top of this network makes all material from participating data providers accessible for European and interdisciplinary research.
  • Automatic Transcription of Oral History Interviews

    This webservice and web application uses automatic speech recognition to provide the transcriptions of recordings spoken in Dutch. You can upload and process only one file per project. For bulk processing and other questions, please contact Henk van den Heuvel at h.vandenheuvel@let.ru.nl.
  • Stylene, a robust, modular system for stylometry and readability research

    Stylene is a robust, modular system for stylometry and readability research on the basis of existing techniques for automatic text analysis and machine learning, and the development of a web service that allows researchers in the humanities and social sciences to analyze texts with this system. In this way, the project will make available to researchers recent advances in research on the computational modeling of style and readability. Background Stylene consists of an educational demonstration interface and tools for stylometry (authorship attribution and profiling) and readability research for Dutch. The Stylene system consists of a popularization interface for learning to understand stylometric analysis, and web-­based interfaces to software for readability and stylometry research aimed at researchers from the humanities and social sciences who don’t want to develop or install such software themselves. Stylene has been created in the context of CLARIN Flanders.
    Daelemans, W, De Clercq, O and Hoste, V. 2017. Stylene: an Environment for Stylometry and Readability Research for Dutch. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 195–209. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.16. License: CC-BY 4.0
  • TQE: Transcription Quality Evaluation

    The Transcription Quality Evaluation (TQE) tool is an instrument that automatically evaluates the quality of phonetic transcriptions. The application makes it possible to upload pairs of files consisting of an audio file and a transcription file and process them as follows: the audio signal and the phonetic transcription are aligned, segment boundaries are derived for each phone, and for each segment-phone combination it is determined how well they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, a number ranging from 0-100%, indicating how good the fit is, i.e. the quality of the phone transcription. The higher the number, the better the fit. The output of the TQE tool consists of a TQE measure and the segment boundaries for each phone in the corpus. The TQE tool thus makes it possible to find (sequences of) segments for which the match of the phone symbols with the audio signal is not optimal, in other words, the TQE tool can be used to check the quality of phonetic transcriptions. This can be useful for validating (manual) phonetic transcriptions, but also to compare and select (‘competing’) transcriptions, e.g. to study pronunciation variation. The TQE tool can thus be usefully applied in all research – in various (sub-) fields of humanities and language and speech technology (L&ST) – in which audio and phonetic transcriptions are involved.
  • ISOcat

    This service is no longer operational! The ISO TC37 Data Category Registry (DCR) was created in 2008 as one of the first ISO standards delivered in the form of a database (ISOcat). The Max Planck Institute for Psycholinguistics (MPI) has provided development, hosting, and support services and acted as the Registration Authority (RA) until December 2014. For users from the European CLARIN research infrastructure, the Meertens Institute develops and hosts a new registry for CLARIN relevant concepts based on the corresponding ISOcat data categories, such as those used for the Component MetaData Infrastructure (CMDI). This can be found here: http://portal.clarin.nl/node/4216. ISO 12620 provides a framework for defining data categories compliant with the ISO/IEC 11179 family of standards. According to this model, each data category is assigned a unique administrative identifier, together with information on the status or decision-making process associated with the data category. In addition, data category specifications in the DCR contain linguistic descriptions, such as data category definitions, statements of associated value domains, and examples. Data category specifications can be associated with a variety of data element names and with language-specific versions of definitions, names, value domains and other attributes. For now the entries of the Data Category Registry are still available in a static manner, i.e., can't be changed anymore. All Data Category Peristent IDentifiers, e.g., http://www.isocat.org/datcat/DC-4146 (link is external), remain resolvable. The public part of the registry can be browsed via the Guest workspace: http://www.isocat.org/rest/user/guest/workspace . new location for this data category registry is http://www.datcatinfo.net/ .
  • Fast and easy development of pronunciation lexicons for names

    The AUTONOMATA transcription tool set consists of a transcription tool and learning tools, with which one can enrich word lists with precise information on the pronunciation. Thee uses a general grapheme-to-phoneme converter (the g2p-converter).
    This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.
    De AUTONOMATA-transcriptietoolset bestaat uit een transcriptietool en learning tools, waarmee men woordenlijsten kan verrijken met nauwkeurige uitspraakinformatie. De tool maakt gebruik van een algemene grafeem-naar-foneemomzetter (de g2p-omzetter).
  • Namescape Named Entity Recognition

    Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.
    de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0
    Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47
  • DuELME: Search interface to the Dutch Electronic Lexicon of Multiword Expressions

    The DuELME search interface provides access to the DUELME electronic lexicon, which contains more than 5,000 Dutch multiword expressions (MWEs). MWEs with the same syntactic pattern are grouped in the same equivalence class. The search interface enables users to search for MWEs on the basis of a range of syntactic and semantic criteria, among them expression, pattern id, written form, type, conjugation, polarity, parameters, form, etc. Extensive documentation on the structure of the database is available. DuELME (Dutch Electronic Lexicon of Multiword Expressions) is one of the results of the project Identification and Representation of Multiword Expressions (IRME). The lexical descriptions boast to be highly theory- and implementation-neutral. The DUELME LMF lexicon is suitable for theoretical research on multiword expressions as for use in NLP systems. The DuELME-LMF project has been carried out within the CLARIN-NL programme.
    Grégoire, N. (2009), Untangling Multiword Expressions. A study on the representation and variation of Dutch multiword expressions, PhD thesis, University of Utrecht.