Result filters

Metadata provider

Language

Resource type

  • Web service
  • Local application

Availability

Active filters:

  • Resource type: Web service
  • Resource type: Local application
Loading...
6 record(s) found

Search results

  • PICCL: Philosophical Integrator of Computational and Corpus Libraries

    PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.
    Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf
    PICCL
  • Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

    Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.
    Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf
    Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf
    Frog (plain text input)
    Frog (folia+xml input)
  • Ucto Tokeniser

    Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.
    Ucto
  • Automatic Transcription of Oral History Interviews

    This webservice and web application uses automatic speech recognition to provide the transcriptions of recordings spoken in Dutch. You can upload and process only one file per project. For bulk processing and other questions, please contact Henk van den Heuvel at h.vandenheuvel@let.ru.nl.
  • OpenConvert

    The OpenConvert tools convert to TEI or FOLiA from a number of input formats (alto, text, word, HTML, ePub). The tools are available as a Java command line tool, a web service and a web application.The OpenConvert Tools were created by IVDNT in the OpenConvert project. The OpenConvert tools convert to TEI or FOLiA from a number of input formats (alto, text, word, HTML, ePub). The tools are available as a Java command line tool, a web service and a web application. Furthermore, as a proof of concept, the website currently provides two annotation tools: a simple Tokenizer for TEI files and a modern Dutch part of speech tagger.
    The tool service can be called as a REST webservice which returns responses in XML, allowing it to be part of a webservice tool chain.
    Input TEI, plain text, HTML
    ALTO XML input
    ePub input
    directory containing files of a valid input type
    zip file (with extension .zip) containing files of a valid input type
    Free for academic use. Non-applicable for commercial parties
    CLARIN based login required. The Clarin federation accepts login from many europian institutions. please seehttp://www.clarin.eu/content/service-provider-federation for more details
    input file name (File upload)
    Format of input file
    Format of output file
    to specify the tagger or tokeniser
    input file mimetype is application/tei+xml
    input file mimetype is text/html
    input file mimetype is text/alto+xml
    input file mimetype is application/msword
    input file mimetype is application/epub+zip
    input file mimetype is text/plain
    output file mimetype is application/tei+xml
    output file mimetype is text/folia+xml
    Basic tagger-lemmatizer for modern Dutch
    a TEI tokenizer