Result filters

Metadata provider

  • CSD Tools

Language

Resource type

Availability

Active filters:

  • Metadata provider: CSD Tools
  • Organisation: Tilburg University
Loading...
4 record(s) found

Search results

  • Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

    Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.
    Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf
    Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf
    Frog (plain text input)
    Frog (folia+xml input)
  • TiCClops: Text-Induced Corpus Clean-up online processing system

    TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Text-Induced Corpus Clean-up (TICCL) was developed first as a prototype at the request of the Koninklijke Bibliotheek - The Hague (KB) and reworked into a production tool according to KB specifications (currently at production version 2.0) mainly during the second half of 2008. It is a fully functional environment for processing possibly very large corpora in order to largely remove the undesirable lexical variation in them. It has provisions for various input and output formats, is flexible and robust and has very high recall and acceptable precision. As a spelling variation detection system it is to the developer’s knowledge unique in making principled use of the input text as possible source for target output canonical forms. As such it is far less domain-sensitive than other approaches: the domain is largely covered by the input text collection. TICCL comes in two variants: one with a classic CLAM web application interface, and one with the PhilosTEI interface.
    Reynaert, M. (2008). All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
    Reynaert, M. (2010). Character confusion versus focus word-based correction of spelling and ocr variants in corpora. International Journal on Document Analysis and Recognition, pp 1-15, URL http://dx.doi.org/10.1007/s10032-010-0133-5
  • Usage

    The system here allows you to convert your book pages' images into editable text, presented in a particular text format called XML (eXtended Markup Language) of a particular type called Text-Encoding Initiative or TEI XML. This particular format was developed specifically for being able to mark-up or annotate the text you want to work on, i.e. to add all manner of further information to the actual text, e.g. to build a critical edition of it, which is most likely exactly what you want to do with your author's work.
    Betti, A, Reynaert, M and van den Berg, H. 2017. @PhilosTEI: Building Corpora for Philosophers. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 379–392. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.32. License: CC-BY 4.0
  • PILNAR: Pilgrimage Narratives - a corpus for studying the profile of the modern pilgrim

    A corpus of pilgrimage narratives with Dutch texts written after ca. 2000 that present the thoughts and impressions of pilgrims to Santiago de Compostela. The PILNAR corpus is a source for research for a variety of (sub)disciplines: culture studies, ritual and religious studies, but also media and e-culture studies (cf the use of blogs and other social media for the self-presentation of experiences). Only for authorized users. The PILNAR corpus contains six subcorpora: - Volumes of De Jacobsstaf 1986-: 84 pdf files; - Volumes of De Pelgrim of the Flemish Society of Santiago de Compostella nos. 1-4 (16mb, 10mb, 16mb) (both societies work collaborate closely); - Volumes of Ultreia, a newsletter; 3 issues available now: January, February, April 2011; - Pilgrimage accounts and blogs by pilgrims available via the Societies Netherlands: circa 140 files; Flemish: circa 138 files; - A corpus of pilgrimage narratives compiled on the occasion of the exhibition in Museum Catharijneconvent held in collaboration with the Society: www.pelgrimsverhalen.nl; (link is external) already on the site now: about 180 fields (as of July 2011); - Accounts and narratives that come in after a specially targeted notice via the site and periodical by the Society (De Jacobsstaf), with perhaps a Flemish companion piece (De Pelgrim).