CLARIN Tool Portal

FLAT: FoLiA-Linguistic-Annotation-Tool

1 resources

FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure. Features Web-based, multi-user environment Server-side document storage, divided into 'namespaces', by default each user has his own namespace. Active documents are held in memory server-side. Read and write permissions to access namespaces are fully configurable. Concurrency (multiple users may edit the same document similtaneously) Full versioning control for documents (using git), allowing limitless undo operations. (in foliadocserve) Full annotator information, with timestamps, is stored in the FoLiA XML and can be displayed by the interface. The git log also contains verbose information on annotations. Annotators can indicate their confidence for each annotation Highly configurable interface; interface options may be disabled on a per-configuration basis. Multiple configurations can be deployed on a single installation Displays and retains document structure (divisions, paragraphs, sentence, lists, etc) Support for corrections, of text or linguistic annotations, and alternative annotations. Corrections are never mere substitutes, originals are always preserved! Spelling corrections for runons, splits, insertions and deletions are supported. Supports FoLiA Set Definitions to display label sets. Sets are not predefined in FoLiA and anybody can create their own. Supports Token Annotation and Span Annotation Supports complex span annotations such as dependency relations, syntactic units (constituents), predicates and semantic roles, sentiments, stratements/attribution, observations Simple metadata editor for editing/adding arbitrary metadata to a document. Selected metadata fields can be shown in the document index as well. User permission model featuring groups, group namespaces, and assignable permissions File/document management functions (copying, moving, deleting) Allows converter extensions to convert from other formats to FoLiA on upload In-document search (CQL or FQL), advanced searches can be predefined by administrators Morphosyntactic tree visualisation (constituency parses and morphemes) Higher-order annotation: associate features, comments, descriptions with any linguistic annotation

PICCL: Philosophical Integrator of Computational and Corpus Libraries

1 resources

PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.

Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf

PICCL

Use "PICCL: Philosophical Integrator of Computational and Corpus Libraries"

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

1 resources

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.

Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf

Frog (plain text input)

Frog (folia+xml input)

Metadata Editor, Browser and Organiser for IMDI and CMDI

1 resources

Arbil (Archive Builder) is a metadata editor, browser and organiser for metadata in IMDI and CMDI format. It is a Java desktop application that runs on most operating systems. Arbil can be used to create new metadata from scratch for resources on your local machine, or it can be used to download and modify metadata that are already in an archive. Arbil is a generic CMDI editor and therefore supports all CMDI profiles. It has a built-in file type verification tool that is configured to check files against the list of accepted file types for The Language Arhive, this can however be overruled for other archives.

Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto

ELAN Multimedia Annotator

1 resources

ELAN is a professional tool for the creation of complex annotations on video and audio resources. With ELAN a user can add an unlimited number of annotations to audio and/or video streams. An annotation can be a sentence, word or gloss, a comment, translation or a description of any feature observed in the media. Annotations can be created on multiple layers, called tiers. Tiers can be hierarchically interconnected. An annotation can either be time-aligned to the media or it can refer to other existing annotations. The textual content of annotations is always in Unicode and the transcription is stored in an XML format. ELAN provides several different views on the annotations, each view is connected and synchronized to the media playhead. Up to 4 video files can be associated with an annotation document. Each video can be integrated in the main document window or displayed in its own resizable window. ELAN delegates media playback to an existing media framework, like Windows Media Player, QuickTime or JMF (Java Media Framework). As a result a wide variety of audio and video formats is supported and high performance media playback can be achieved. ELAN is written in the Java programming language and the sources are available for non-commercial use. It runs on Windows, Mac OS X and Linux. ELAN has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ELAN in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research, the Max Planck Society and the ARC Centre of Excellence for the Dynamics of Language.

Web-based Annotation Explorer

1 resources

Annex (Annotation Explorer) is a web-based tool for exploring and viewing annotated multimedia recordings in an archive. ANNEX can play audio and video files in a web browser along with annotations in a variety of formats: ELAN (EAF), Shoebox/Toolbox text, CHAT (CHILDES annotation format), Plain text, CSV, PDF, SubRip, Praat TextGrid, HTML and XML. Annex will visualise the annotations in synchrony with the media files as long as time-alignment information is available. If no time-alignment information is available, a default segment duration is assumed. Annex has a graphical interface that resembles the interface of the ELAN annotation tool to some extent, with a number of different view modes such as subtitle view, timeline view and grid view. Annex runs in any modern web browser with the Adobe Flash plugin (> version 10) installed. ANNEX has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ANNEX in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research and the the Max Planck Society.

Ucto Tokeniser Engine

1 resources

The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Fast and easy development of pronunciation lexicons for names

1 resources

The AUTONOMATA transcription tool set consists of a transcription tool and learning tools, with which one can enrich word lists with precise information on the pronunciation. Thee uses a general grapheme-to-phoneme converter (the g2p-converter).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

De AUTONOMATA-transcriptietoolset bestaat uit een transcriptietool en learning tools, waarmee men woordenlijsten kan verrijken met nauwkeurige uitspraakinformatie. De tool maakt gebruik van een algemene grafeem-naar-foneemomzetter (de g2p-omzetter).

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

1 resources

A demo of a speech recognizer for POIs (Points of Interest). This demo recognizes stay-over addresses and eateries in some big cities (inter alia Amsterdam, Antwerpen, Gent, Rotterdam).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

Een demo van een spraakherkenner voor POIs (Points of Interest). Deze demo herkent overnachtingsadressen en eetgelegenheden in enkele grote steden (o.a. Amsterdam, Antwerpen, Gent, Rotterdam).

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

FLAT: FoLiA-Linguistic-Annotation-Tool

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

Metadata Editor, Browser and Organiser for IMDI and CMDI

Ucto Tokeniser

ELAN Multimedia Annotator

Web-based Annotation Explorer

Ucto Tokeniser Engine

Fast and easy development of pronunciation lexicons for names

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing