CLARIN Tool Portal

Namescape Named Entity Recognition

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

Gabmap is a free web-based application for dialectometry. It measures the differences in sets of phonetic (or phonemic) transcriptions via edit distance. Gabmap has a graphical user interface that makes string comparison facility available as a web application.

1 resources

Gabmap is a free web-based application for dialectometry. It measures the differences in sets of phonetic (or phonemic) transcriptions via edit distance. Gabmap has a graphical user interface that makes string comparison facility available as a web application. This enables wider experimentation with the techniques. Gabmap (a.k.a. ADEPT) measures pronunciation distances based on transcriptions and aligns pronunciation transcription data. Because the measurements are numeric, they can be aggregated in order to obtain an estimation of overall pronunciation differences among varieties. The software uses a range of edit distance (or Levenshtein) algorithms. It is useful for dialectologists, and has been used extensively in dialectology. It has occasionally been used for other purposes, e.g. trying to identify loan words automatically (Paris, Musée de l’Homme, central Asian project involving Turkic and also Indo-Iranian languages). The software has also been used as the basis of a program to multi-align pronunciation data for the purpose of phylogenetic analysis. The Gabmap developers claim that the program could also be used to measure deviant pronunciation e.g. of second-language learners, or of speakers with speech defects. A variety of related algorithms are implemented in the package of C programs (and R programs) the developers turned into a web application, including a basic version regarding segments only as same or different, and other versions variously respecting consonant/vowel distinctions; using phonetic segment distances as provided via an assignment of phonetic or phonological features to segments; using segment distances as learned from refining alignment correspondences; and applying weightings derived from (inverse) frequency (derived from Goebl’s work) or depending on the position within a word. There are useful auxiliary programs aimed at assisting users in converting phonetic data to X-SAMPA and at spotting errors. (In working with users in the past, the developers have noted that data conversion is a major hurdle.) There are additional meta-analytical calculations aimed at gauging how reliable the signal is from a given set of data, and aimed at comparing various options with respect to the degree to which they capture the geographic cohesion one assumes in dialectology. Gabmap was developed in the CLARIN-NL project ADEPT: Assaying Differences via Edit-Distance of Pronunciation Transcriptions.

Nerbonne, J., Colen, R., Gooskens, C., Kleiweg, P., and Leinonen, T. (2011). Gabmap — A Web Application for Dialectology. Dialectologica, Special issue II, 65-89.

T. Leinonen, Ç. Çöltekin, J. Nerbonne, Using Gabmap. Lingua Vol. 178, 71-83, doi:10.1016/j.lingua.2015.02.004

Usage

3 resources

The system here allows you to convert your book pages' images into editable text, presented in a particular text format called XML (eXtended Markup Language) of a particular type called Text-Encoding Initiative or TEI XML. This particular format was developed specifically for being able to mark-up or annotate the text you want to work on, i.e. to add all manner of further information to the actual text, e.g. to build a critical edition of it, which is most likely exactly what you want to do with your author's work.

Betti, A, Reynaert, M and van den Berg, H. 2017. @PhilosTEI: Building Corpora for Philosophers. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 379–392. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.32. License: CC-BY 4.0

Arthurian Fiction

1 resources

This research tool provides information on medieval Arthurian narratives and the manuscripts in which they are transmitted throughout Europe. The tool discloses a database consists of linked records on over two hundred texts, more than thousand manuscripts and two hundred persons. The database is work in progress: a considerable number of records have yet to be completed, while fresh discoveries of narratives and manuscripts invite new entries. The compilers of the database hope that this tool will contribute to further research into Arthurian fiction as a pan-European phenomenon. The Arthurian Fiction web application enables searching for manuscripts, narratives and persons from the Arthurian Fiction narratives and manuscripts metadata database Arthurian Fiction Data. Each of these object types can be searched for using facets specific to the object type. These include: - for manuscripts: institute, date, origin, physical form, extant leave, leaf sizes, illustration type, scripts, scribe, patron and several more; - for narratives: date, origin, languages, cycle, manuscript, author, patron, verse type, meter, length, intertextuality properties and many more; - for persons: name, gender, subtype, background, manuscript, and narratives. The user can, if desired, select a subset of the facets to work with. In addition, keyword search is possible for all fields, query results can be sorted by a variety of keys and queries can be saved. There is also a web service with an API for the Arthurian Fiction narratives and manuscripts database. This web service makes use of SOLR queries via HTTP POST requests.

This movie is in Dutch with English subtitles.

Besamusca, A.A.M. and Quinlan, J. (2012). The Fringes of Arthurian Fiction. Arthurian literature, 29, 191-241.

Boot, P. (2012), Manuscripten koning Arthur op tafel, E-Data & Research 7(1), 2012.

Dalen-Oskam, K. van and Besamusca, B. (2011), Arthurian Fiction in Medieval Europe: Narratives and Manuscripts, presentation held at the CLARIN-NL Kick-off meeting Call 2, Utrecht, February 9, 2011.

Dalen-Oskam, K. van (2011), ArthurianFiction, presentation held at the Call 3 information session, Utrecht, August 25, 2011.

Corpus extraction tool LIST 1.2

2 resources

The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI P5 XML formats and outputs .CSV files that can be imported into Microsoft Excel or similar statistical processing software. Version 1.2 adds support for Gigafida 2.0 in XML format and fixes a bug which disabled the extraction of character-level n-grams from normalized forms in the GOS 1.0 corpus.

Use "Corpus extraction tool LIST 1.2"

EWBST tests for english

4 resources

Submission contains test generated for EWBST test of English word embedding models. Tests were created with princeton wordnet and plWN english synsts.

Use "EWBST tests for english"

Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6

3 resources

This Neural Machine Translation model for Slovene-English language pair was trained following the NVIDIA NeMo NMT AAYN recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/machine_translation/machine_translation.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for translating text written in Slovene language to English and vice versa. The training corpus was built from publicly available datasets, including Parallel corpus EN-SL RSDO4 1.0 (https://www.clarin.si/repository/xmlui/handle/11356/1457), as well as a small portion of proprietary data. In total the training corpus consisted of 32.638.758 translation pairs and the validation corpus consisted of 8.163 translation pairs. The model was trained on 64GPUs and on the validation corpus reached a SacreBleu score of 48.3191 (at epoch 37) for translation from Slovene to English and a SacreBleu score of 53.8191 (at epoch 47) for translation from English to Slovene.

Use "Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6"

Parsito

2 resources

Parsito is a fast open-source dependency parser written in C++. Parsito is based on greedy transition-based parsing, it has very high accuracy and achieves a throughput of 30K words per second. Parsito can be trained on any input data without feature engineering, because it utilizes artificial neural network classifier. Trained models for all treebanks from Universal Dependencies project are available (37 treebanks as of Dec 2015). Parsito is a free software under Mozilla Public License 2.0 (http://www.mozilla.org/MPL/2.0/) and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA (http://creativecommons.org/licenses/by-nc-sa/4.0/) license, although for some models the original data used to create the model may impose additional licensing conditions. Parsito website http://ufal.mff.cuni.cz/parsito contains download links of both the released packages and trained models, hosts documentation and offers online demo. Parsito development repository http://github.com/ufal/parsito is hosted on GitHub.

Use "Parsito"

Public License Selector

3 resources

Customizable tool that will help user select the right open license for his data or software

Use "Public License Selector"

CEC6-Converter

2 resources

Diese Software erlaubt eine Konvertierung von *.cec6.gz-Dateien in 24 Formate, die in der Korpuslinguistik / NLProc üblich sind. Die Ausführung ist unter allen modernen Betriebssystemen möglich (Windows, Linux, MacOS). Die Binärdateien wurden für die x64-Architektur kompiliert. Sollten Sie einen Prozessor (CPU) verwenden, der eine x86- oder ARM-Architektur hat, dann nutzen Sie bitte die Anleitung: andere Betriebssysteme bzw. x86 / ARM / ARM64. --- This software allows the conversion of *.cec6.gz files into 24 formats that are commonly used in corpus linguistics / NLProc. Execution is possible under all modern operating systems (Windows, Linux, MacOS). The binary files have been compiled for the x64 architecture. If you are using a processor (CPU) with x86 or ARM architecture, please use the instructions for "other operating systems or x86 / ARM / ARM64".

Use "CEC6-Converter"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Namescape Named Entity Recognition

Gabmap is a free web-based application for dialectometry. It measures the differences in sets of phonetic (or phonemic) transcriptions via edit distance. Gabmap has a graphical user interface that makes string comparison facility available as a web application.

Usage

Arthurian Fiction

Corpus extraction tool LIST 1.2

EWBST tests for english

Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6

Parsito

Public License Selector

CEC6-Converter