Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • GreynirPackage 2.6.1

    GreynirPackage is a Python 3 package for working with Icelandic natural language text. Greynir can parse text into sentence trees, find lemmas, inflect noun phrases, assign part-of-speech tags and much more. Greynir's sentence trees can inter alia be used to extract information from text, for instance about people, titles, entities, facts, actions and opinions. Greynir uses the Tokenizer package, by the same authors, to tokenize text. More information at https://github.com/mideind/GreynirPackage and detailed documentation at https://greynir.is/doc/. GreynirPackage er Python 3 pakki sem vinnur með íslenskan texta. Greynir þáttar texta í setningar, lemmar og markar texta, beygir nafnliði og margt fleira. Hægt er að nýta þáttunartrén sem tólið býr til í þeim tilgangi að draga upplýsingar út úr texta, til dæmis um manneskjur, starfstitla, sérnafnaeiningar, staðreyndir, atburði og skoðanir. Greynir notar Tokenizer-pakkann, eftir sömu höfunda, til að tilreiða texta. Frekari upplýsingar má finna á https://github.com/mideind/GreynirPackage og ítarlega skjölun (á ensku) á https://greynir.is/doc/.
  • OptaHopper: phrase-level sentiment with opinion targets

    A phrase- and sentence-level sentiment analysis tool (deep-learning TreeLSTM, TreeHopper) integrated with opinion finding. Any sentiment dictionary may be used as an input feature, including lemma-level and plWordNet emo dictionaries. In the case of plWordNet emo, provided integration with the WSD module. The OPFI (Opinion Finder) app used for opinion target extraction.
  • Corpus extraction tool LIST 1.3

    The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI P5 XML formats and outputs .CSV files that can be imported into Microsoft Excel or similar statistical processing software. Version 1.3 adds support for the KOST 2.0 Slovene Learner Corpus (http://hdl.handle.net/11356/1887) in XML format. It also allows program execution using the command line (see 00README.txt for details), and uses a later version of Java (tested using JDK 21). In addition, Windows users no longer need to have Java installed on their computers to run the program.
  • The Trankit model for linguistic processing of written and spoken Slovenian 1.2

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15), thus producing significantly better results for spoken data. In contrast to the previous versions of this model (1.0, 1.1), the model 1.2 was trained on a new SST train-dev-test split introduced in UD v2.15.
  • ABLTagger (PoS) - 1.0.0

    A Part-of-Speech (PoS) tagger for Icelandic. In this submission, you will find ABLTagger v1.0.0. This is a PoS tagger that works with the revised tagset and achieves an accuracy of 95.59% on MIM-Gold (cross-validation). For additional details, error analysis and categorization of this tagger and other taggers (including a previous version of ABLTagger), see I4 report for milestone (2020) in Language Technology Programme for Icelandic 2019-2023. For the most recent versions, installation, usage, and other instructions see https://github.com/cadia-lvl/POS on CLARIN: - Python wheel, version 1.0.0 - GitHub repository at version 1.0.0 - Model files (tagger and dictionaries) - Docker image, version 1.0.0 ------------------------------------------------------------------------------------------- Markari fyrir íslensku. Í þessum pakka er ABLTagger v.1.0.0. Þetta er markari sem virkar fyrir nýja markamengið og nær 95.59% nákvæmni á MÍM-Gull (krossprófanir). Fyrir nánari upplýsingar, villugreiningu og villuflokkun fyrir þennan markara og aðra (ásamt fyrri útgáfu af þessum markara), sjá I4 skýrslu fyrir vörðu 3 (2020) í Máltækniáætlun fyrir íslensku 2019-2023. Fyrir nýjustu útgáfur, uppsetninga-, notenda- og aðrar leiðbeiningar sjá https://github.com/cadia-lvl/POS Á CLARIN: - Python wheel, útgáfa 1.0.0 - GitHub repository af útgáfu 1.0.0 - Líkan (markari and orðabækur) - Docker mynd, útgáfa 1.0.0
  • BinPackage v0.4.2

    BinPackage is a Python Package that embeds the vocabulary of the DMII (bin.arnastofnun.is) and offers various lookups and queries of the data. The database, maintained by The Árni Magnússon Institute for Icelandic Studies, contains over 6.5 million entries, over 3.1 million unique word forms, and about 300,000 distinct lemmas. The database has been encapsulated in an easy-to-install Python package, and compressed from 400+ megabyte CSV file to an ~80 megabyte indexed binary structure. More information at: https://github.com/mideind/BinPackage BinPackage er Python-pakki utan um BÍN, Beygingarlýsingu íslensks nútímamáls (bin.arnastofnun.is), sem inniheldur yfir 6,5 milljónir færslna, 3,1 milljón einstakra orðmynda og um 300.000 stakar lemmur. Stofnun Árna Magnússonar heldur utan um gagnagrunninn. Gagnagrunninum, um 400 megabæta CSV-skrá, hefur verið pakkað í um 80 megabæta tvíundarbyggingu með vísum. Frekari upplýsingar á: https://github.com/mideind/BinPackage
  • EFCL Channelizer

    Extremely fast digital audio channelizer implementation, usable as a building block for experimental ASR front-ends or signal denoising applications. Also applicable in software defined radios, due to its high throughput. It comes in a form of a C/C++ library and an executable example program which reads input stream, splitting it into equidistant frequency channels, emitting their data to the output. Features: (1) Hand tuned SIMD-aware assembly for x86 (SSE) and IA64 (AVX) as well as for ARM (NEON) processors. (2) Generic non-SIMD C++ implementation for other architectures. (3) Capable of taking advantage of multicore CPUs. (4) Fully configurable number of channels and the output decimation rate. (5) User supplied FIR of the channel separation filter, which allows to specify the width of the channels, whether they should overlap or be separated. (6) Input and output signal samples are treated as complex numbers. (7) Speed over 750 complex MS/s achieved on Core i7 4710HQ @ 2.5GHz, when channelizing into 72 output channels with a FIR length of 1152 samples, using 3 computing threads. (8) Runs under Linux OS.
  • OCR Post-Processing Tool for Icelandic 22.10

    ENGLISH: This entry consists of two trained transformer models to correct OCR errors, along with ca 50,000 line pairs of OCRed/corrected text. The models were trained on ca 900,000 lines (~7,000,000 tokens) of which only 50,000 (~400,000 tokens) were from real OCRed texts. It can be assumed that increasing the amount of such data can significantly improve the tool. More info in README.md. ICELANDIC: Þessi gagnahirsla inniheldur tvö þjálfuð transformer-líkön til leiðréttingar á ljóslestrarvillum, auk u.þ.b. 50.000 línupara úr ljóslesnum/leiðréttum textum. Líkönin voru þjálfuð á u.þ.b. 900.000 línum (~7.000.000 orð) en af þeim voru ekki nema um 50.000 (~400.000 orð) úr raunverulegum ljóslesnum gögnum. Ætla má að aukið magn slíkra gagna geti bætt tólið umtalsvert. Nánari upplýsingar í README.md.
  • UDConverter 22.01

    UDConverter is a tool for converting constituency treebanks in the format of PPCHE (Penn Parsed Corpora of Historical English) to dependency treebanks following the Universal Dependencies framework. The tool is specifically configured to convert treebanks in the IcePaHC format. This version has an 81.39 LAS (labeled attachment score). UDConverter er tól til að varpa liðgerðartrjábönkum á sniði PPCHE (Penn Parsed Corpora of Historical English) yfir í venslatrjábanka samkvæmt Universal Dependencies-sniði. Tólið er sérstaklega þróað til að varpa trjábönkum á sniði IcePaHC. Þessi útgáfa er með 81,39 LAS (labeled attachment score).