CLARIN Tool Portal

FLAT: FoLiA-Linguistic-Annotation-Tool

1 resources

FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure. Features Web-based, multi-user environment Server-side document storage, divided into 'namespaces', by default each user has his own namespace. Active documents are held in memory server-side. Read and write permissions to access namespaces are fully configurable. Concurrency (multiple users may edit the same document similtaneously) Full versioning control for documents (using git), allowing limitless undo operations. (in foliadocserve) Full annotator information, with timestamps, is stored in the FoLiA XML and can be displayed by the interface. The git log also contains verbose information on annotations. Annotators can indicate their confidence for each annotation Highly configurable interface; interface options may be disabled on a per-configuration basis. Multiple configurations can be deployed on a single installation Displays and retains document structure (divisions, paragraphs, sentence, lists, etc) Support for corrections, of text or linguistic annotations, and alternative annotations. Corrections are never mere substitutes, originals are always preserved! Spelling corrections for runons, splits, insertions and deletions are supported. Supports FoLiA Set Definitions to display label sets. Sets are not predefined in FoLiA and anybody can create their own. Supports Token Annotation and Span Annotation Supports complex span annotations such as dependency relations, syntactic units (constituents), predicates and semantic roles, sentiments, stratements/attribution, observations Simple metadata editor for editing/adding arbitrary metadata to a document. Selected metadata fields can be shown in the document index as well. User permission model featuring groups, group namespaces, and assignable permissions File/document management functions (copying, moving, deleting) Allows converter extensions to convert from other formats to FoLiA on upload In-document search (CQL or FQL), advanced searches can be predefined by administrators Morphosyntactic tree visualisation (constituency parses and morphemes) Higher-order annotation: associate features, comments, descriptions with any linguistic annotation

PICCL: Philosophical Integrator of Computational and Corpus Libraries

1 resources

PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.

Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf

PICCL

Use "PICCL: Philosophical Integrator of Computational and Corpus Libraries"

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

1 resources

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.

Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf

Frog (plain text input)

Frog (folia+xml input)

Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto

Ucto Tokeniser Engine

1 resources

The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Automatic Transcription of Oral History Interviews

1 resources

This webservice and web application uses automatic speech recognition to provide the transcriptions of recordings spoken in Dutch. You can upload and process only one file per project. For bulk processing and other questions, please contact Henk van den Heuvel at h.vandenheuvel@let.ru.nl.

Alpino: a dependency parser for Dutch (CLST web service and application)

1 resources

This is a web service and web application to the Alpino dependency parser for Dutch, developed in the context of the PIONIER Project Algorithms for Linguistic Processing.

git clone --depth 1 git://urd.let.rug.nl/Alpino.git

Taalportaal, the linguistics of Dutch, Frisian and Afrikaans online.

Taalportaal (or Language Portal) is an interactive knowledge base about Dutch, Frisian and Afrikaans. It provides access to a comprehensive and authoritative scientific grammar for these three languages.

van der Wouden, T, Bouma, G, van deCamp, M, van Koppen, M, Landsbergen, F and Odijk, J. 2017. Enriching a Scientific Grammar with Links to Linguistic Resources: The Taalportaal. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 299–310. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.24. License: CC-BY 4.0

Use "Taalportaal, the linguistics of Dutch, Frisian and Afrikaans online."

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

FLAT: FoLiA-Linguistic-Annotation-Tool

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

Ucto Tokeniser

Ucto Tokeniser Engine

Automatic Transcription of Oral History Interviews

Alpino: a dependency parser for Dutch (CLST web service and application)

Taalportaal, the linguistics of Dutch, Frisian and Afrikaans online.