CLARIN Tool Portal

Metadata Editor, Browser and Organiser for IMDI and CMDI

1 resources

Arbil (Archive Builder) is a metadata editor, browser and organiser for metadata in IMDI and CMDI format. It is a Java desktop application that runs on most operating systems. Arbil can be used to create new metadata from scratch for resources on your local machine, or it can be used to download and modify metadata that are already in an archive. Arbil is a generic CMDI editor and therefore supports all CMDI profiles. It has a built-in file type verification tool that is configured to check files against the list of accepted file types for The Language Arhive, this can however be overruled for other archives.

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

1 resources

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.

Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf

Frog (plain text input)

Frog (folia+xml input)

Nederlab, online laboratory for humanities research on Dutch text collections

1 resources

The Nederlab project aims to bring together all digitized texts relevant to Dutch national heritage, the history of Dutch language and culture (c. 800 - present) in one user-friendly and tool-enriched open access web interface, allowing scholars to simultaneously search and analyze data from texts spanning the full recorded history of the Netherlands, its language and culture. The project builds on various initiatives: for corpora Nederlab collaborates with the scientific libraries and institutions, for infrastructure with CLARIN (and CLARIAH), for tools with eHumanities programmes such as Catch, IMPACT and CLARIN (TICCL, frog). Nederlab will offer a large number of search options with which researchers can find the occurrence of a particular term in a particular corpus or subcorpus. It'll also offer visualization of search results through line graphs, bar graphs, circle graphs, or scatter graphs. Furthermore, this online lab will offer a large set of tools, like tokenization tools, tools for spelling normalization, PoS-tagging tools, lemmatization tools, a computational historical lexicon and indices. Also, the use of (semi-) automatic syntactic parsing, tools for text mining, data mining and sentiment mining, Named Entity Recognition tools, coreference resolution tools, plagiarism detection tools, paraphrase detection tools and cartographical tools is offered The first version of Nederlab was launched in early 2015, it’ll be expanded until the end of 2017. Nederlab is financed by NWO, KNAW, CLARIAH and CLARIN-NL.

http://www.nederlab.nl/wp/?page_id=12

Automatic Annotation of Multi-modal Language Resources

1 resources

The AAM-LR project provides a web service that helps field researchers to annotate audio- and video-recordings. At the top level the service marks the time intervals at which specific persons in the recording are speaking. In addition, the service provides a global phonetic annotation, using language independent phone models and phonetic features. Speech is separated from speaker noises such as laughing. Note: this service has been withdrawn and the URLs and PID do not resolve anymore!

LASSY Word Relations Search Web Application

1 resources

The LASSY word relations web application makes it possible to search for sentences that contain pairs of words between which there is a grammatical relation. One can search in the Dutch LASSY-SMALL Treebank (1 million tokens), in which the syntactic parse of each sentence has been manually verified, and in (a part of) the LASSY-LARGE Treebank (700 million tokens ),in which the syntactic parse of each sentence has been added by the automatic parser Alpino. One can restrict the query to search for words of a particular Part-of-Speech, which is very useful in the case of syntactic ambiguities. One can also leave out the string of the word, so that one can obtain e.g. a list of sentences in which any adverb modifies a given verb, or even any word modifies a given verb. On the page that lists the found sentences one can view the exact syntactic structure of each sentence by a simple click. The application also provides detailed frequency information of all found sentences and word pairs. The Lassy treebanks have been made by the KU Leuven and the Rijksuniversiteit Groningen through financing of the Dutch Language Union. One can obtain these treebanks through the HLT Agency (TST-Centrale). Use PaQu (http://dev.clarin.nl/node/4182) for many more options and if you want to search for word pairs in your own text corpus.

PICCL: Philosophical Integrator of Computational and Corpus Libraries

1 resources

PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.

Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf

PICCL

Use "PICCL: Philosophical Integrator of Computational and Corpus Libraries"

FLAT: FoLiA-Linguistic-Annotation-Tool

1 resources

FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure. Features Web-based, multi-user environment Server-side document storage, divided into 'namespaces', by default each user has his own namespace. Active documents are held in memory server-side. Read and write permissions to access namespaces are fully configurable. Concurrency (multiple users may edit the same document similtaneously) Full versioning control for documents (using git), allowing limitless undo operations. (in foliadocserve) Full annotator information, with timestamps, is stored in the FoLiA XML and can be displayed by the interface. The git log also contains verbose information on annotations. Annotators can indicate their confidence for each annotation Highly configurable interface; interface options may be disabled on a per-configuration basis. Multiple configurations can be deployed on a single installation Displays and retains document structure (divisions, paragraphs, sentence, lists, etc) Support for corrections, of text or linguistic annotations, and alternative annotations. Corrections are never mere substitutes, originals are always preserved! Spelling corrections for runons, splits, insertions and deletions are supported. Supports FoLiA Set Definitions to display label sets. Sets are not predefined in FoLiA and anybody can create their own. Supports Token Annotation and Span Annotation Supports complex span annotations such as dependency relations, syntactic units (constituents), predicates and semantic roles, sentiments, stratements/attribution, observations Simple metadata editor for editing/adding arbitrary metadata to a document. Selected metadata fields can be shown in the document index as well. User permission model featuring groups, group namespaces, and assignable permissions File/document management functions (copying, moving, deleting) Allows converter extensions to convert from other formats to FoLiA on upload In-document search (CQL or FQL), advanced searches can be predefined by administrators Morphosyntactic tree visualisation (constituency parses and morphemes) Higher-order annotation: associate features, comments, descriptions with any linguistic annotation

CLARIN Vocabulary Service

1 resources

The CLARIN Vocabulary Service is a running instance of the OpenSKOS exchange and publication platform for SKOS vocabularies. OpenSKOS offers several ways to publish SKOS vocabularies (upload SKOS file, harvest from another OpenSKOS instance with OAI-PMH, construct using the RESTful API) and to use vocabularies (search and autocomplete using the API, harvest using OAI-PMH, inspect in the interactive Editor or consult as Linked Data). This CLARIN OpenSKOS instance is hosted by the Meertens Institute. Contents This OpenSKOS instance currently publishes SKOS versions of three vocabularies: - ISO-639-3 language codes, as published by SIL. - Closed and simple Data Categories from the ISOcat metadata profile. - A manually constructed and curated list of Organizations, based on the CLARIN VLO. .

Brugman, H. 2017. CLAVAS: A CLARIN Vocabulary and Alignment Service. In: Odijk J. & van Hessen A, CLARIN in the Low Countries, ch 5, pp 61-69. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.5

Web-based Annotation Explorer

1 resources

Annex (Annotation Explorer) is a web-based tool for exploring and viewing annotated multimedia recordings in an archive. ANNEX can play audio and video files in a web browser along with annotations in a variety of formats: ELAN (EAF), Shoebox/Toolbox text, CHAT (CHILDES annotation format), Plain text, CSV, PDF, SubRip, Praat TextGrid, HTML and XML. Annex will visualise the annotations in synchrony with the media files as long as time-alignment information is available. If no time-alignment information is available, a default segment duration is assumed. Annex has a graphical interface that resembles the interface of the ELAN annotation tool to some extent, with a number of different view modes such as subtitle view, timeline view and grid view. Annex runs in any modern web browser with the Adobe Flash plugin (> version 10) installed. ANNEX has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ANNEX in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research and the the Max Planck Society.

Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Metadata Editor, Browser and Organiser for IMDI and CMDI

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

Nederlab, online laboratory for humanities research on Dutch text collections

Automatic Annotation of Multi-modal Language Resources

LASSY Word Relations Search Web Application

PICCL: Philosophical Integrator of Computational and Corpus Libraries

FLAT: FoLiA-Linguistic-Annotation-Tool

CLARIN Vocabulary Service

Web-based Annotation Explorer

Ucto Tokeniser