CLARIN Tool Portal

Blacklab AutoSearch Corpus Search

1 resources

This demonstrator allows users to define one or more corpora and upload data for the corpora, after which the corpora will be made automatically searchable in a private workspace. Users can upload text data annotated with lemma + part of speech tags in TEI or FoLiA format, either as a single XML file or as an archive (zip or tar.gz) containing several XML files. Corpus size is limited to begin with (25 MB limit per uploaded file; 500,000 token limit for an entire corpus), but these limits may be increased at a later point in time. The search application is powered by the INL BlackLab corpus search engine. The search interface is the same as the one used in for example the Corpus of Contemporary Dutch / Corpus Hedendaags Nederlands.

Use "Blacklab AutoSearch Corpus Search"

FLAT: FoLiA-Linguistic-Annotation-Tool

1 resources

FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure. Features Web-based, multi-user environment Server-side document storage, divided into 'namespaces', by default each user has his own namespace. Active documents are held in memory server-side. Read and write permissions to access namespaces are fully configurable. Concurrency (multiple users may edit the same document similtaneously) Full versioning control for documents (using git), allowing limitless undo operations. (in foliadocserve) Full annotator information, with timestamps, is stored in the FoLiA XML and can be displayed by the interface. The git log also contains verbose information on annotations. Annotators can indicate their confidence for each annotation Highly configurable interface; interface options may be disabled on a per-configuration basis. Multiple configurations can be deployed on a single installation Displays and retains document structure (divisions, paragraphs, sentence, lists, etc) Support for corrections, of text or linguistic annotations, and alternative annotations. Corrections are never mere substitutes, originals are always preserved! Spelling corrections for runons, splits, insertions and deletions are supported. Supports FoLiA Set Definitions to display label sets. Sets are not predefined in FoLiA and anybody can create their own. Supports Token Annotation and Span Annotation Supports complex span annotations such as dependency relations, syntactic units (constituents), predicates and semantic roles, sentiments, stratements/attribution, observations Simple metadata editor for editing/adding arbitrary metadata to a document. Selected metadata fields can be shown in the document index as well. User permission model featuring groups, group namespaces, and assignable permissions File/document management functions (copying, moving, deleting) Allows converter extensions to convert from other formats to FoLiA on upload In-document search (CQL or FQL), advanced searches can be predefined by administrators Morphosyntactic tree visualisation (constituency parses and morphemes) Higher-order annotation: associate features, comments, descriptions with any linguistic annotation

PICCL: Philosophical Integrator of Computational and Corpus Libraries

1 resources

PICCL is a set of workflows for corpus building through OCR, post-correction, modernization of historic language and Natural Language Processing. It combines Tesseract Optical Character Recognition, TICCL functionality and Frog functionality in a single pipeline. Tesseract offers Open Source software for optical character recognition. TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Frog enriches textual documents with various linguistic annotations.

Martin Reynaert, Maarten van Gompel, Ko van der Sloot and Antal van den Bosch. 2015. PICCL: Philosophical Integrator of Computational and Corpus Libraries. Proceedings of CLARIN Annual Conference 2015, pp. 75-79. Wrocław, Poland. http://www.nederlab.nl/cms/wp-content/uploads/2015/10/Reynaert_PICCL-Philosophical-Integrator-of-Computational-and-Corpus-Libraries.pdf

PICCL

Use "PICCL: Philosophical Integrator of Computational and Corpus Libraries"

Nederlab, online laboratory for humanities research on Dutch text collections

1 resources

The Nederlab project aims to bring together all digitized texts relevant to Dutch national heritage, the history of Dutch language and culture (c. 800 - present) in one user-friendly and tool-enriched open access web interface, allowing scholars to simultaneously search and analyze data from texts spanning the full recorded history of the Netherlands, its language and culture. The project builds on various initiatives: for corpora Nederlab collaborates with the scientific libraries and institutions, for infrastructure with CLARIN (and CLARIAH), for tools with eHumanities programmes such as Catch, IMPACT and CLARIN (TICCL, frog). Nederlab will offer a large number of search options with which researchers can find the occurrence of a particular term in a particular corpus or subcorpus. It'll also offer visualization of search results through line graphs, bar graphs, circle graphs, or scatter graphs. Furthermore, this online lab will offer a large set of tools, like tokenization tools, tools for spelling normalization, PoS-tagging tools, lemmatization tools, a computational historical lexicon and indices. Also, the use of (semi-) automatic syntactic parsing, tools for text mining, data mining and sentiment mining, Named Entity Recognition tools, coreference resolution tools, plagiarism detection tools, paraphrase detection tools and cartographical tools is offered The first version of Nederlab was launched in early 2015, it’ll be expanded until the end of 2017. Nederlab is financed by NWO, KNAW, CLARIAH and CLARIN-NL.

http://www.nederlab.nl/wp/?page_id=12

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

1 resources

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. It performs automatic linguistic enrichment such as part of speech tagging, lemmatisation, named entity recognition, shallow parsing, dependency parsing and morphological analysis. All NLP modules are based on TiMBL.

Iris Hendrickx, Antal van den Bosch, Maarten van Gompel, Ko van der Sloot and Walter Daelemans. 2016.Frog: A Natural Language Processing Suite for Dutch. CLST Technical Report 16-02, pp 99-114. Nijmegen, the Netherlands. https://github.com/LanguageMachines/frog/blob/master/docs/frogmanual.pdf

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114. http://ilk.uvt.nl/downloads/pub/papers/tadpole-final.pdf

Frog (plain text input)

Frog (folia+xml input)

Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto

CLARIN Vocabulary Service

1 resources

The CLARIN Vocabulary Service is a running instance of the OpenSKOS exchange and publication platform for SKOS vocabularies. OpenSKOS offers several ways to publish SKOS vocabularies (upload SKOS file, harvest from another OpenSKOS instance with OAI-PMH, construct using the RESTful API) and to use vocabularies (search and autocomplete using the API, harvest using OAI-PMH, inspect in the interactive Editor or consult as Linked Data). This CLARIN OpenSKOS instance is hosted by the Meertens Institute. Contents This OpenSKOS instance currently publishes SKOS versions of three vocabularies: - ISO-639-3 language codes, as published by SIL. - Closed and simple Data Categories from the ISOcat metadata profile. - A manually constructed and curated list of Organizations, based on the CLARIN VLO. .

Brugman, H. 2017. CLAVAS: A CLARIN Vocabulary and Alignment Service. In: Odijk J. & van Hessen A, CLARIN in the Low Countries, ch 5, pp 61-69. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.5

The Typological Database System (TDS)

1 resources

The Typological Database System (TDS) is a web-based service that provides integrated access to a collection of independently developed typological databases. Unified querying is supported with the help of an integrated ontology. The component databases of the TDS are cross-linguistic databases, developed for research in language typology and linguistics. Together they contain some 1200 different descriptive properties, with information about more than 1000 languages. (Because of the heterogeneous nature of the collection, most properties are only filled for a fraction of the languages). Most of the data is in the form of high-level "analytical" properties, but there are also a few collections of example sentences (with glosses) illustrating particular phenomena. Language typology, the study of the range of language variation and universals, is a data-intensive discipline that increasingly relies on electronic databases. Improved availability of the data collected in the TDS enhances its potential to support linguistic research. The TDS can be used to help answer questions such as "which languages have the basic word order Verb-Object-Subject", "what kind of phonological stress systems are common" "are languages with subject-verb agreement more likely to allow null subjects than languages without it" etc. The system is not an oracle: In all cases, only partial information is returned, as collected and deposited in the system by the creators of the component databases. But this information can be invaluable to other researchers, either as a complete answer to a specific question or as the starting point for further research. Given that the collected data represents linguistic analysis and often novel theoretical approaches, it is impossible to map it to a single "consensus" standard. While in some limited cases it is possible to completely reconcile data from different sources, the system places a premium on preserving the theoretical orientations and analyses of the component databases, which are presented side by side as alternative datasets in the same topical group. The TDS project was carried out by a research group of the Netherlands Graduate School of Linguistics (LOT), with members representing the University of Amsterdam, Leiden University, Radboud University Nijmegen, and Utrecht University. It was developed with support from NWO (Netherlands Organization for Scientific Research) grant 380-30-004 / INV-03-12 and from participating universities. The initial phase of the project was started in September 2000, and the project entered the implementation phase on 1 May 2004. Originally scheduled to run for three years, it was extended until 31 December 2007. The TDS server and data collections continued to be augmented until 2009. While the original TDS web server is still operational, web technologies evolve rapidly. The system had begun to show its age even before the end of the project in 2009, motivating migration of the data collection to an archival platform. But due to the complexity and diversity of the component databases, the data cannot be usefully navigated without specialized supporting software; useful archiving necessitates a software access point alongside the static data. Under the "TDS Curator" project, supported by a CLARIN-NL Call 1 grant, the TDS has migrated to a new platform, hosted by the Data Archiving and Networked Services (DANS), that conforms to CLARIN infrastructural requirements. Both versions of the system remain in operation.

Windhouwer, M, Dimitriadis, A and Akerman, V. 2017. Curating the Typological Database System. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 123–132. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.11. License: CC-BY 4.0

A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans, T. Bíró. How to integrate databases without starting a typology war: The Typological Database System. In S. Musgrave, M. Everaert and A. Dimitriadis (eds.), The use of databases in cross-linguistic research, Mouton de Gruyter, March 2009.

M. Windhouwer, A. Dimitriadis. Sustainable operability: Keeping complex resources alive. In Proceedings of the LREC workshop on Sustainability of Language Resources and Tools for Natural Language Processing (SustainableNLP08 ), Marrakech, Morocco, May 31, 2008.

A. Dimitriadis. Managing Differences: The TDS Approach. In Proceedings of the E-MELD Workshop on Toward the Interoperability of Language Resources (E-MELD 2007 ), Stanford, CA, July 13-15, 2007. Position paper.

A. Dimitriadis, A. Saulwick, M. Windhouwer. Semantic relations in ontology mediated linguistic data integration. In Proceedings of the E-MELD Workshop on Morphosyntactic Annotation and Terminology: Linguistic Ontologies and Data Categories for Linguistic Resources (E-MELD 2005 ), Cambridge, Massachusetts, July 1-3, 2005.

A. Saulwick, M. Windhouwer, A. Dimitriadis, R. Goedemans. Distributed tasking in ontology mediated integration of typological databases for linguistic research. In J. Castro and E. Teniente, Proceedings of the CAiSE'05 Workshops (International Workshop on Data Integration and the Semantic Web (DISWeb'05) in conjuction with CAiSE'05 ), Volume I, pp 303-317, Porto, Portugal, June 14, 2005.

A. Dimitriadis, P. Monachesi. Integrating Different Data Types in a Typological Database System. In P. Austin, H. Dry and P. Wittenburg (eds.), Proceedings of the International Workshop on Resources and Tools in Field Linguistics, Las Palmas, Canary Islands, Spain, 2002.

P. Monachesi, A. Dimitriadis, R. Goedemans, A. Mineur, M. Pinto. A Unified System for Accessing Typological Databases. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 3), Las Palmas, Canary Islands, Spain, 2002.

Ucto Tokeniser Engine

1 resources

The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Usage

3 resources

The system here allows you to convert your book pages' images into editable text, presented in a particular text format called XML (eXtended Markup Language) of a particular type called Text-Encoding Initiative or TEI XML. This particular format was developed specifically for being able to mark-up or annotate the text you want to work on, i.e. to add all manner of further information to the actual text, e.g. to build a critical edition of it, which is most likely exactly what you want to do with your author's work.

Betti, A, Reynaert, M and van den Berg, H. 2017. @PhilosTEI: Building Corpora for Philosophers. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 379–392. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.32. License: CC-BY 4.0

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Blacklab AutoSearch Corpus Search

FLAT: FoLiA-Linguistic-Annotation-Tool

PICCL: Philosophical Integrator of Computational and Corpus Libraries

Nederlab, online laboratory for humanities research on Dutch text collections

Frog: An advanced Natural Language Processing Suite for Dutch (Web Service and Application)

Ucto Tokeniser

CLARIN Vocabulary Service

The Typological Database System (TDS)

Ucto Tokeniser Engine

Usage