CLARIN Tool Portal

Corpus of Contemporary Dutch

1 resources

The Corpus of Contemporary Dutch (Corpus Hedendaags Nederlands (CLARIN)) is a collection of texts consisting of more than 800,000 texts from newspapers, journals, TV News broadcasts and legal materials (1814-2013). The corpus was created by combining the older 5, 27 and 38 million words corpora and the Parole Corpus, supplemented by newspaper texts from NRC and De Standaard (until 2013). In addition, it contains corpus material from Suriname and the Dutch Antilles.

Corpus Hedendaags Nederlands (CLARIN) is een tekstverzameling van meer dan 800.000 teksten uit kranten, tijdschriften, journaaluitzendingen en juridisch materiaal (1814-2013). Het corpus is een samenvoeging van het oude 5, 27 en 38 Miljoen Woorden Corpus en het PAROLE Corpus, aangevuld met krantenteksten uit NRC en De Standaard (tot 2013). Daarnaast bevat het corpus materiaal uit Suriname en de Antillen.

WebCelex

1 resources

WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German. CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. For each language, the database contains detailed information on: orthography (variations in spelling, hyphenation), phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), morphology (derivational and compositional structure, inflectional paradigms), syntax (word class, word class-specific subcategorizations, argument structures) and word frequency (summed word and lemma counts, based on recent and representative text corpora).

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

1 resources

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.

Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf

Frog (plain text input)

Frog (folia+xml input)

PICCL: Philosophical Integrator of Computational and Corpus Libraries

1 resources

PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.

Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf

PICCL

Use "PICCL: Philosophical Integrator of Computational and Corpus Libraries"

Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto

Automatic Transcription of Oral History Interviews

1 resources

This webservice and web application uses automatic speech recognition to provide the transcriptions of recordings spoken in Dutch. You can upload and process only one file per project. For bulk processing and other questions, please contact Henk van den Heuvel at h.vandenheuvel@let.ru.nl.

BNM-I: Linked Data on Middle Dutch Sources Kept Worldwide

1 resources

Web application for consultation, using facetted search, and collaborative editing of the curated e-BNM collection of textual, codicological and historical information about thousands of Middle Dutch manuscripts kept world wide.The Bibliotheca Neerlandica Manuscripta and Impressa collects and makes available information on medieval manuscripts produced in the Netherlands regardless where they are kept. Documentation activities concentrate on the Middle-Dutch texts and their authors that have been transmitted in these manuscripts, on the individuals and institutions that have been involved in the manuscript production (scribes, illuminators, monasteries) and on the former and present manuscript owners. Since 1991 two-thirds of this ‘paper’ information, checked and supplemented with information from recent publications, has been converted into electronic data and incorporated in a database ( BNM-I ), which can be searched online. In 2013 this database was converted in the e-BNM+ project into a flexible datastructure that turned BNM-I into a key open access resource to which many other resources can easily be linked. The new BNM-I: - will be freely accessible for every user, anywhere in the world; - can easily implement new contributions or corrections by scientists; - can easily be linked to related databases - in the near future cross searching several databases in one interface will be possible; - will be prepared for the inclusion of new data, like: research data on Middle Dutch texts that were printed before 1541 and the books in which they are preserved; - articles on Middle Dutch texts and their authors (associated with the current thesaurised information).

VK: Verrijkt Koninkrijk (Enriched Kingdom)

2 resources

Dr Loe de Jong’s Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog remains the most appealing history of German occupied Dutch society (1940-1945). Published between 1969 and 1991, the 14 volumes, consisting of 30 parts and 18,000 pages combine the qualities of an authoritative work for a general audience, and an inevitable point of reference for scholars. In VK this corpus is enriched with: - Tokenization, sentence splitting, part-of-speech tagging and lemmatization (done with the FROG software from Tilburg University); - Named entity recognition (done using UvA's NE tagger (specially trained for Dutch within the Stevin DuoMan project)); - Polarity tagging (positive/negative connotation of words) (done using UvA's FietsTas software (developed for Dutch within the Stevin DuoMan project)); - Named entity reconciliation by linking to Wikipedia (done using software developed by Edgar Meij (UvA)).

REST web interface, HTTP GET

De Boer, V., J. van Doornik, L. Buitinck, K. Ribbens, and T. Veken. Enriched Access to a Large War Historical Text using the Back of the Book Index. Extended abstract presented at the Workshop on Semantic Web and Information Extraction (SWAIE 2012), Galway, Ireland, 9 october 2012

L. Buitinck and M.Marx, Two-Stage Named-Entity Recognition Using Averaged Perceptrons in proceedings of NDLB, Groningen, Netherlands, 2012. http://link.springer.com/chapter/10.1007%2F978-3-642-31178-9_17

Use "VK: Verrijkt Koninkrijk (Enriched Kingdom)"

Alpino: a dependency parser for Dutch (CLST web service and application)

1 resources

This is a web service and web application to the Alpino dependency parser for Dutch, developed in the context of the PIONIER Project Algorithms for Linguistic Processing.

git clone --depth 1 git://urd.let.rug.nl/Alpino.git

Arthurian Fiction

1 resources

This research tool provides information on medieval Arthurian narratives and the manuscripts in which they are transmitted throughout Europe. The tool discloses a database consists of linked records on over two hundred texts, more than thousand manuscripts and two hundred persons. The database is work in progress: a considerable number of records have yet to be completed, while fresh discoveries of narratives and manuscripts invite new entries. The compilers of the database hope that this tool will contribute to further research into Arthurian fiction as a pan-European phenomenon. The Arthurian Fiction web application enables searching for manuscripts, narratives and persons from the Arthurian Fiction narratives and manuscripts metadata database Arthurian Fiction Data. Each of these object types can be searched for using facets specific to the object type. These include: - for manuscripts: institute, date, origin, physical form, extant leave, leaf sizes, illustration type, scripts, scribe, patron and several more; - for narratives: date, origin, languages, cycle, manuscript, author, patron, verse type, meter, length, intertextuality properties and many more; - for persons: name, gender, subtype, background, manuscript, and narratives. The user can, if desired, select a subset of the facets to work with. In addition, keyword search is possible for all fields, query results can be sorted by a variety of keys and queries can be saved. There is also a web service with an API for the Arthurian Fiction narratives and manuscripts database. This web service makes use of SOLR queries via HTTP POST requests.

This movie is in Dutch with English subtitles.

Besamusca, A.A.M. and Quinlan, J. (2012). The Fringes of Arthurian Fiction. Arthurian literature, 29, 191-241.

Boot, P. (2012), Manuscripten koning Arthur op tafel, E-Data & Research 7(1), 2012.

Dalen-Oskam, K. van and Besamusca, B. (2011), Arthurian Fiction in Medieval Europe: Narratives and Manuscripts, presentation held at the CLARIN-NL Kick-off meeting Call 2, Utrecht, February 9, 2011.

Dalen-Oskam, K. van (2011), ArthurianFiction, presentation held at the Call 3 information session, Utrecht, August 25, 2011.

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Corpus of Contemporary Dutch

WebCelex

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Ucto Tokeniser

Automatic Transcription of Oral History Interviews

BNM-I: Linked Data on Middle Dutch Sources Kept Worldwide

VK: Verrijkt Koninkrijk (Enriched Kingdom)

Alpino: a dependency parser for Dutch (CLST web service and application)

Arthurian Fiction