CLARIN Tool Portal

Active filters:

Resource type: Web service
Resource type: Local application

6 record(s) found

Search results

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

1 resources

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.

Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf

Frog (plain text input)

Frog (folia+xml input)
PICCL: Philosophical Integrator of Computational and Corpus Libraries

1 resources

PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.

Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf

PICCL

Use "PICCL: Philosophical Integrator of Computational and Corpus Libraries"
Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto
Automatic Transcription of Oral History Interviews

1 resources

This webservice and web application uses automatic speech recognition to provide the transcriptions of recordings spoken in Dutch. You can upload and process only one file per project. For bulk processing and other questions, please contact Henk van den Heuvel at h.vandenheuvel@let.ru.nl.
Alpino: a dependency parser for Dutch (CLST web service and application)

1 resources

This is a web service and web application to the Alpino dependency parser for Dutch, developed in the context of the PIONIER Project Algorithms for Linguistic Processing.

git clone --depth 1 git://urd.let.rug.nl/Alpino.git
OpenConvert

1 resources

The OpenConvert tools convert to TEI or FOLiA from a number of input formats (alto, text, word, HTML, ePub). The tools are available as a Java command line tool, a web service and a web application.The OpenConvert Tools were created by IVDNT in the OpenConvert project. The OpenConvert tools convert to TEI or FOLiA from a number of input formats (alto, text, word, HTML, ePub). The tools are available as a Java command line tool, a web service and a web application. Furthermore, as a proof of concept, the website currently provides two annotation tools: a simple Tokenizer for TEI files and a modern Dutch part of speech tagger.

The tool service can be called as a REST webservice which returns responses in XML, allowing it to be part of a webservice tool chain.

Input TEI, plain text, HTML

ALTO XML input

ePub input

directory containing files of a valid input type

zip file (with extension .zip) containing files of a valid input type

Free for academic use. Non-applicable for commercial parties

CLARIN based login required. The Clarin federation accepts login from many europian institutions. please seehttp://www.clarin.eu/content/service-provider-federation for more details

input file name (File upload)

Format of input file

Format of output file

to specify the tagger or tokeniser

input file mimetype is application/tei+xml

input file mimetype is text/html

input file mimetype is text/alto+xml

input file mimetype is application/msword

input file mimetype is application/epub+zip

input file mimetype is text/plain

output file mimetype is application/tei+xml

output file mimetype is text/folia+xml

Basic tagger-lemmatizer for modern Dutch

a TEI tokenizer

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Ucto Tokeniser

Automatic Transcription of Oral History Interviews

Alpino: a dependency parser for Dutch (CLST web service and application)

OpenConvert