Result filters

Metadata provider

  • DSpace

Language

Resource type

Tool task

Availability

Organisation

  • Polish-Japanese Academy of Information Technology

Project

Active filters:

  • Metadata provider: DSpace
  • Organisation: Polish-Japanese Academy of Information Technology
Loading...
4 record(s) found

Search results

  • Parallel Corpora from Comparable Corpora tool

    Script consists of 2 parts: article parser aligner Required software (install before using script): yalign additional Ubuntu packages: mongodb ipython python-nose python-werkzeug Wiki article parser Article parser works in 2 steps: Extracts articles from wiki dumps Saves extracted articles to local DB (Mongo DB) Before using parser, wiki dumps should be downloaded and extracted to some directory (directory should contain *.xml, *.sql files). For each language 2 dump files should be downloaded - articles and language link dumps, here is examples: PL: http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2 http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-langlinks.sql.gz EN: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-langlinks.sql.gz IMPORTANT NOTE: Engilsh dumps after extraction will require about 50 Gb of free space. During parsing parser can require up to 8 Gb ram. Article parser have option "main language" - its language for which articles extracted from other languages only if it exist in main language. Eg. if main language is PL, then article extractor first extracts all article for PL, then article for other languages and only if such articles exists in PL translation. This reduces space requirements. For help use: $ python parse_wiki_dumps.py -h Example command: $ python parse_wiki_dumps.py -d ~/temp/wikipedia_dump/ -l pl -v Wikipedia aligner Aligner can be used when article extracted from dumps. Aligner takes article pairs for given language pair, aligns text and saves parallel corpara to 2 files. Option "-s" can be used to limit number of symbols in file (by default size is 50000000 symbols, thats around 50-60Mb) By default aligner tries to continue aligning where it was stopped, to force aligning from begining need to use "--restart" key For help use: $ python align.py -h Example command: $ python align.py -o wikipedia -l en-pl -v Euronews crawler Crawler finds links to articles using euronews archive http://euronews.com/2004/, and in parallel extracts and saves article texts to DB. For help use: $ python parse_euronews.py -h Example command: $ python parse_euronews.py -l en,pl -v Euronews aligner Starting aligner for euronews articles: $ python align.py -o euronews -l en-pl -v Saving articles in plain text Script "save_plain_text.py" can be used to save all articles in plain text format, it accepts path for saving articles, languages of articles to be saved, and source of articles (euronews, wikipedia). For help use: $ python save_plain_text.py -h Example command: $ python save_plain_text.py -l en,pl -r [path] -o euronews Yalign selection This script tries random parameters for model of yalign in order to get best parameters for aligning provided text samples. Before using yalign_selection script need to prepare article samples using prepare_random_sampling.py script. Creating folder with article samples can be done with this command: $ python prepare_random_sampling.py -o wikipedia -c 10 -l ru-en -v -o wikipedia - source of articles can be wikipedia or euronews -c 10 - number of articles to extract -l ru-en - languages to extract This script will create "article_samples" folder with articles files, then you can create manually aligned files (you need align article of second language), for this example you need to align "en" file, files named "_orig" - should be left unmodified Then manual aligning is ready you can run selection script here is example: $ python yalign_selection.py --samples article_samples/ --lang1 ru --lang2 en --threshold 0.1536422609112349e-6 --threshold_step 0.0000001 --threshold_step_count 10 --penalty 0.014928930455303857 --penalty_step 0.0001 --penalty_step_count 1 -m ru-en Here is what each parameter means: --samples article_samples/ - path to article samples folder --lang1 ru --lang2 en - languages to align (articles of second language should be aligned manually, script will be using "??_orig" files, align them automatically and will compare with manually aligned) --threshold 0.1536422609112349e-6 - threshold value of model, selection will be made around this value --threshold_step 0.0000001 - step of changing value --threshold_step_count 10 - number of steps to check below and above vaule, eg if value 10, step 1, and count 2, script will check 8 9 10 11 12 same parameters for penalty -m ru-en - path to yalign model Also you can use (to tweak comparison of text lines in files): --length and --similarity --length - min diffirence in length in order to mark lines similar, 1 - same length, 0.5 - at least half of length --similarity - similarity of text in lines, 1 - exactly same, 0 - completely different. For similarity check sentences compared as sequence of characters. It has multiprocessing support already. Use -t option to set number of threads, by default it sets number of threads equal to number of CPU. for additional parameters you can use '-h' key. Then yalign_selection.py script will finish work it will produce csv file, with first column equal to threshold, second column equal to penalty, and third is similarity for this parameters. Align with HUNALING method In order to use hunalign you need add "--hunalign" option in align.py script, here is example: $ python align.py -l li-hu -r align_result -o wikipedia --hunalign In my empirical study it provides better results when articles are translations of each other or simillar in leghth and content. Align From fodler For aligning already aligned texts using hunalign: Command exmaple is: $ python align_aligned_using_hunalign.py source/ target/ Final info Wołk, K., & Marasek, K. (2015, September). Tuned and GPU-accelerated parallel data mining from comparable corpora. In International Conference on Text, Speech, and Dialogue (pp. 32-40). Springer International Publishing. http://arxiv.org/pdf/1509.08639 For more detailed usage instruction see howto.pdf. For any questions: | Krzysztof Wolk | krzysztof@wolk.pl
  • Polish Speech Services

    This archive contains the source code and configuration of the speech tools web service available at http://mowa.clarin-pl.eu/mowa. The services provided include: + speech to text alignemnt + speaker diarization + speech transcription + speech activity detection and noise classification + keyword spotting
  • Long term archive operating system source code

    This submission contains the operating system of the long-term archive, built in the Polish-Japanese Academy of Information Technology for the Clarin-PL project. Basic elements of the archive are data nodes, equipped with mass memories. The nodes are controlled by embedded low-power computers which are independently powered up only when their storage is about to be accessed. This allows not only for limiting the overall energy consumption but also lowers environmental demands (no air-condition needed). The nodes are grouped in trays. Basic and recommended configuration allows for 30 nodes in trays, but it is possible to extend this limit up to 253. Each tray contains several networks designed for data transport, devices’ state control and power supply. Communication with clients is conducted through buffers that are the only parts visible from externally connected networks. Therefore, stored files are completely isolated and cannot be directly accessed. Multiple trays located at single physical site create a complete archive. It is possible to split storage space into virtual archives that are separated on logical level. The operating system of the data network allows to store from 3 to 7 copies of single digital file in different nodes. Moreover, additional copies of the resource may be stored automatically in remotely located archives. The trays are treated as local parts of wider dispersed data network structure. Software of the archive enables not only secure read and write operations data but it also automatically takes care of the stored data. It periodically regenerates physical state of saved files. In case of device failure clients are transparently redirected to local or remote redundant copies. The mechanism of "software bots" was implemented. Archive can be supplied with external programs for processing files stored inside the data network. This allows for data analyzes, indexation, post-data creation, statistical computations or finding associations in unstructured data sets of Big Data type. Only the output of software bot can be externally accessed what makes such operations very secure. Client programs communicate with the archive using set of simple protocols based on key-value pair strings, making it convenient to build web interfaces for archive access and administration. By automating the supervision of the resources, reduction of requirements for storage, precise energy consumption control and proposed solution significantly lowers the cost of long-term data storage.