TiCClops: Text-Induced Corpus Clean-up online processing system
<?xml version="1.0" encoding="UTF-8"?>
<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1"
xmlns:cmdp="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1342181139640"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
CMDVersion="1.2"
xsi:schemaLocation="http://www.clarin.eu/cmd/1 https://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1342181139640 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1342181139640/1.2/xsd">
<cmd:Header>
<cmd:MdCreator>rogierkraf</cmd:MdCreator>
<cmd:MdCreationDate>2013-11-30+02:00</cmd:MdCreationDate>
<cmd:MdProfile>clarin.eu:cr1:p_1342181139640</cmd:MdProfile>
<cmd:MdCollectionDisplayName>CLARIN Netherlands</cmd:MdCollectionDisplayName>
</cmd:Header>
<cmd:Resources>
<cmd:ResourceProxyList>
<cmd:ResourceProxy id="TICCLops001">
<cmd:ResourceType>Resource</cmd:ResourceType>
<cmd:ResourceRef>http://hdl.handle.net/10032/6d9be58990cc19becc6c37fd85996a26</cmd:ResourceRef>
</cmd:ResourceProxy>
<cmd:ResourceProxy id="TICCLops002">
<cmd:ResourceType>Resource</cmd:ResourceType>
<cmd:ResourceRef>http://ticclops.clarin.inl.nl/ticclops/</cmd:ResourceRef>
</cmd:ResourceProxy>
<cmd:ResourceProxy id="TICCLops003">
<cmd:ResourceType>Resource</cmd:ResourceType>
<cmd:ResourceRef> http://ticclops.clarin.inl.nl/philostei/</cmd:ResourceRef>
</cmd:ResourceProxy>
</cmd:ResourceProxyList>
<cmd:JournalFileProxyList/>
<cmd:ResourceRelationList/>
</cmd:Resources>
<cmd:Components>
<cmdp:ClarinSoftwareDescription>
<cmdp:GeneralInfo>
<cmdp:name xml:lang="eng">TiCClops</cmdp:name>
<cmdp:title xml:lang="eng">TiCClops: Text-Induced Corpus Clean-up online processing system</cmdp:title>
<cmdp:publicationYear>unknown</cmdp:publicationYear>
<cmdp:url>http://ticclops.clarin.inl.nl/ticclops/</cmdp:url>
<cmdp:CLARINCentre>Dutch Language Institute</cmdp:CLARINCentre>
<cmdp:OriginalSource>http://portal.clarin.nl/node/1914</cmdp:OriginalSource>
<cmdp:ReleaseStatus>
<cmdp:LifeCycleStatus>published</cmdp:LifeCycleStatus>
<cmdp:lastUpdate>2015-03-25</cmdp:lastUpdate>
</cmdp:ReleaseStatus>
<cmdp:NationalProjects>
<cmdp:Project>
<cmdp:name>CLARIN-NL</cmdp:name>
<cmdp:title>CLARIN in the Netherlands</cmdp:title>
<cmdp:id>184.021.003</cmdp:id>
<cmdp:funder>NWO</cmdp:funder>
<cmdp:url>http://www.clarin.nl</cmdp:url>
<cmdp:Contact>
<cmdp:Person>Jan Odijk</cmdp:Person>
<cmdp:Role>National Coordinator</cmdp:Role>
<cmdp:Address>Utrecht, the Netherlands</cmdp:Address>
<cmdp:Email>j.odijk@uu.nl</cmdp:Email>
<cmdp:Department>UiL-OTS</cmdp:Department>
<cmdp:Organisation>Utrecht University</cmdp:Organisation>
</cmdp:Contact>
<cmdp:Duration>
<cmdp:StartYear>2009</cmdp:StartYear>
<cmdp:CompletionYear>2015</cmdp:CompletionYear>
</cmdp:Duration>
</cmdp:Project>
</cmdp:NationalProjects>
<cmdp:Country>
<cmdp:CountryName>Netherlands</cmdp:CountryName>
<cmdp:CountryCoding>NL</cmdp:CountryCoding>
</cmdp:Country>
<cmdp:Description>
<cmdp:Description>TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form.
Text-Induced Corpus Clean-up (TICCL) was developed first as a prototype at the request of the Koninklijke Bibliotheek - The Hague (KB) and reworked into a production tool according to KB specifications (currently at production version 2.0) mainly during the second half of 2008. It is a fully functional environment for processing possibly very large corpora in order to largely remove the undesirable lexical variation in them. It has provisions for various input and output formats, is flexible and robust and has very high recall and acceptable precision. As a spelling variation detection system it is to the developerâs knowledge unique in making principled use of the input text as possible source for target output canonical forms. As such it is far less domain-sensitive than other approaches: the domain is largely covered by the input text collection.
TICCL comes in two variants: one with a classic CLAM web application interface, and one with the PhilosTEI interface.
</cmdp:Description>
</cmdp:Description>
</cmdp:GeneralInfo>
<cmdp:SoftwareFunction>
<cmdp:toolCategory>written language tool</cmdp:toolCategory>
<cmdp:ToolTasks>
<cmdp:toolTask>corpus processing</cmdp:toolTask>
<cmdp:toolTask>orthographic normalisation</cmdp:toolTask>
</cmdp:ToolTasks>
<cmdp:ResearchPhases>
<cmdp:ResearchPhase>Enriching Data</cmdp:ResearchPhase>
</cmdp:ResearchPhases>
<cmdp:ResearchDomains>
<cmdp:researchDomain>Linguistics</cmdp:researchDomain>
</cmdp:ResearchDomains>
<cmdp:LinguisticsSubject>
<cmdp:linguisticsSubject>general linguistics</cmdp:linguisticsSubject>
<cmdp:Description>
<cmdp:Description/>
</cmdp:Description>
</cmdp:LinguisticsSubject>
<cmdp:LinguisticsSubject>
<cmdp:linguisticsSubject>orthography</cmdp:linguisticsSubject>
<cmdp:Description>
<cmdp:Description/>
</cmdp:Description>
</cmdp:LinguisticsSubject>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Dutch</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>nld</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>English</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>eng</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Finnish</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>fin</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>French</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>fra</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>German</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>deu</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>German (Fraktur)</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>deu</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Classical Greek</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>grc</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Modern Greek</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>ell</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Icelandic</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>isl</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Italian</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>ita</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Latin</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>lat</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Polish</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>pol</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Portuguese</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>por</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Russian</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>rus</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Spanish</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>spa</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
<cmdp:LanguageVariety>
<cmdp:languageDependent>yes</cmdp:languageDependent>
<cmdp:Language>
<cmdp:LanguageName>Swedish</cmdp:LanguageName>
<cmdp:ISO639>
<cmdp:iso-639-3-code>swe</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Language>
<cmdp:Centuries>
<cmdp:centuryDependent>no</cmdp:centuryDependent>
</cmdp:Centuries>
</cmdp:LanguageVariety>
</cmdp:SoftwareFunction>
<cmdp:SoftwareImplementation>
<cmdp:distributionMedium>Online available</cmdp:distributionMedium>
<cmdp:UserInterface>
<cmdp:interfaceType>graphical user interface</cmdp:interfaceType>
<cmdp:applicationType>web application</cmdp:applicationType>
</cmdp:UserInterface>
<cmdp:Input>
<cmdp:inputType>image</cmdp:inputType>
<cmdp:inputResource>picture to be OCR-ed and orthographically normalised</cmdp:inputResource>
<cmdp:MimeType>
<cmdp:MimeType>image/tiff</cmdp:MimeType>
<cmdp:MimeType>image/vnd.djvu</cmdp:MimeType>
<cmdp:MimeType>application/pdf</cmdp:MimeType>
</cmdp:MimeType>
</cmdp:Input>
<cmdp:Input>
<cmdp:characterEncoding>UTF8</cmdp:characterEncoding>
<cmdp:inputType>text</cmdp:inputType>
<cmdp:inputResource>text to be orthographically normalised</cmdp:inputResource>
<cmdp:Schema>
<cmdp:schemaname/>
</cmdp:Schema>
<cmdp:MimeType>
<cmdp:MimeType>text/plain</cmdp:MimeType>
</cmdp:MimeType>
</cmdp:Input>
<cmdp:Input>
<cmdp:characterEncoding>UTF8</cmdp:characterEncoding>
<cmdp:inputType>text</cmdp:inputType>
<cmdp:inputResource>text to be orthographically normalised</cmdp:inputResource>
<cmdp:Schema>
<cmdp:schemaname>FoLiA</cmdp:schemaname>
</cmdp:Schema>
<cmdp:MimeType>
<cmdp:MimeType>text/xml</cmdp:MimeType>
</cmdp:MimeType>
</cmdp:Input>
<cmdp:Input>
<cmdp:characterEncoding>UTF8</cmdp:characterEncoding>
<cmdp:inputType>text</cmdp:inputType>
<cmdp:inputResource>lexicon</cmdp:inputResource>
<cmdp:Schema>
<cmdp:schemaname>CSV</cmdp:schemaname>
</cmdp:Schema>
<cmdp:MimeType>
<cmdp:MimeType>text/csv</cmdp:MimeType>
</cmdp:MimeType>
</cmdp:Input>
<cmdp:Input>
<cmdp:characterEncoding>UTF8</cmdp:characterEncoding>
<cmdp:inputType>text</cmdp:inputType>
<cmdp:inputResource>frequency list</cmdp:inputResource>
<cmdp:Schema>
<cmdp:schemaname>CSV</cmdp:schemaname>
</cmdp:Schema>
<cmdp:MimeType>
<cmdp:MimeType>text/csv</cmdp:MimeType>
</cmdp:MimeType>
</cmdp:Input>
<cmdp:Input>
<cmdp:inputType>text</cmdp:inputType>
</cmdp:Input>
<cmdp:Output>
<cmdp:outputType>text</cmdp:outputType>
<cmdp:characterEncoding>UTF8</cmdp:characterEncoding>
<cmdp:outputResource>results of orthographic normalisation</cmdp:outputResource>
<cmdp:Schema>
<cmdp:schemaname>TEI </cmdp:schemaname>
</cmdp:Schema>
<cmdp:MimeType>
<cmdp:MimeType>text/xml</cmdp:MimeType>
</cmdp:MimeType>
</cmdp:Output>
<cmdp:Output>
<cmdp:outputType>text</cmdp:outputType>
<cmdp:characterEncoding>UTF8</cmdp:characterEncoding>
<cmdp:outputResource>input files in one PDF file</cmdp:outputResource>
<cmdp:Schema>
<cmdp:schemaname/>
</cmdp:Schema>
<cmdp:MimeType>
<cmdp:MimeType>application/pdf</cmdp:MimeType>
</cmdp:MimeType>
</cmdp:Output>
</cmdp:SoftwareImplementation>
<cmdp:Access>
<cmdp:ResourceLicense>
<cmdp:license>unknown</cmdp:license>
<cmdp:distributionType>public</cmdp:distributionType>
<cmdp:url>http://ticclops.clarin.inl.nl/ticclops/</cmdp:url>
<cmdp:Price>
<cmdp:amount>0</cmdp:amount>
<cmdp:ISO4217>
<cmdp:iso-4217-currency>EUR</cmdp:iso-4217-currency>
</cmdp:ISO4217>
</cmdp:Price>
</cmdp:ResourceLicense>
<cmdp:Contact>
<cmdp:Email>servicedesk@ivdnt.org</cmdp:Email>
<cmdp:Organisation xml:lang="nld">Instituut voor de Nederlandse Taal</cmdp:Organisation>
<cmdp:Organisation xml:lang="eng">Institute for the Dutch Language</cmdp:Organisation>
<cmdp:Url>http://www.ivdnt.org/</cmdp:Url>
</cmdp:Contact>
</cmdp:Access>
<cmdp:ResourceDocumentation>
<cmdp:Documentation>
<cmdp:title>TICCLops User and Demonstrator Documentation</cmdp:title>
<cmdp:documentationTarget>user</cmdp:documentationTarget>
<cmdp:url>http://ticclops.uvt.nl/ticclops_manual.v101.pdf</cmdp:url>
<cmdp:ISO639>
<cmdp:iso-639-3-code>eng</cmdp:iso-639-3-code>
</cmdp:ISO639>
</cmdp:Documentation>
<cmdp:Publication>
<cmdp:publicationCategory>in proceedings</cmdp:publicationCategory>
<cmdp:publicationPurpose>scientific background</cmdp:publicationPurpose>
<cmdp:peerReviewStatus>yes</cmdp:peerReviewStatus>
<cmdp:Description>
<cmdp:Description LanguageID="eng">Reynaert, M. (2008). All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In: Proceedings of the Sixth International Language Resources and Evaluation (LRECâ08), Marrakech, Morocco.
</cmdp:Description>
</cmdp:Description>
</cmdp:Publication>
<cmdp:Publication>
<cmdp:publicationCategory>article</cmdp:publicationCategory>
<cmdp:publicationPurpose>scientific background</cmdp:publicationPurpose>
<cmdp:peerReviewStatus>yes</cmdp:peerReviewStatus>
<cmdp:Description>
<cmdp:Description LanguageID="eng">Reynaert, M. (2010). Character confusion versus focus word-based correction of spelling and ocr variants in corpora. International Journal on Document Analysis and Recognition, pp 1-15, URL http://dx.doi.org/10.1007/s10032-010-0133-5
</cmdp:Description>
</cmdp:Description>
</cmdp:Publication>
<cmdp:Pictures>
<cmdp:picture type="other" width="150">
http://dev.clarin.nl/sites/default/files/TICClops.jpg
</cmdp:picture>
</cmdp:Pictures>
</cmdp:ResourceDocumentation>
<cmdp:SoftwareDevelopment>
<cmdp:Project>
<cmdp:name>TiCClops: Text-Induced Corpus Clean-up online processing system</cmdp:name>
<cmdp:title>TiCClops: Text-Induced Corpus Clean-up online processing system</cmdp:title>
<cmdp:funder>CLARIN-NL</cmdp:funder>
<cmdp:url>http://portal.clarin.nl/node/1914</cmdp:url>
<cmdp:Contact>
<cmdp:Person>dr. Martin Reynaert</cmdp:Person>
<cmdp:Email>reynaert@tilburguniversity.edu</cmdp:Email>
<cmdp:Organisation xml:lang="eng">Tilburg University</cmdp:Organisation>
</cmdp:Contact>
<cmdp:Duration/>
</cmdp:Project>
<cmdp:Creator>
<cmdp:Contact>
<cmdp:Person>dr. Martin Reynaert</cmdp:Person>
<cmdp:Email>reynaert@tilburguniversity.edu</cmdp:Email>
<cmdp:Organisation xml:lang="eng">Tilburg University</cmdp:Organisation>
</cmdp:Contact>
</cmdp:Creator>
</cmdp:SoftwareDevelopment>
<cmdp:TechnicalInfo>
<cmdp:ImplementationLanguage>
<cmdp:implementationLanguage>unknown</cmdp:implementationLanguage>
<cmdp:version>unknown</cmdp:version>
</cmdp:ImplementationLanguage>
</cmdp:TechnicalInfo>
</cmdp:ClarinSoftwareDescription>
</cmd:Components>
</cmd:CMD>