Quick search Find article
Quick search
Find article

Detecting translations of the same text and data with common source

Kostadin Koroutchev and Manuel Cebrián

Show affiliations


Compression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. In some situations, there exists significant similarity with no literal shared information: text translations, different coding schemes, etc. To overcome this problem, we present a similarity measure that compares the redundancy structure of the data extracted by means of a Lempel–Ziv compression scheme. Each text is represented as a graph in which vertices are text positions and edges represent shared information; with our measure, two texts are similar if they have the same referential topology when compressed.

In this paper we give empirical evidence and a phenomenological explanation that this new measure is a robust indicator, detecting similarity between data coded in different languages.

We also regard a textual data without any structure, but with a common source, and find that we can detect such data and distinguish this situation from the previous one.


Keywords

random graphs, networks

heuristics

data mining (theory)

data mining (experiment)

PACS

84.40.Ua Telecommunications: signal transmission and processing; communication satellites

MSC

68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (See also 94Axx)

Subjects

Electronics and devices

Dates

Issue 10 (October 2006)

Received 1 August 2006, accepted for publication 27 September 2006

Published 18 October 2006



  1. Detecting translations of the same text and data with common source

    Kostadin Koroutchev and Manuel Cebrián J. Stat. Mech. (2006) P10009

  2. The transfer matrix of a superintegrable chiral Potts model as the Q operator of root-of-unity XXZ chain with cyclic representation of U_{\mathsf
{q}}(sl_2)

    Shi-shyr Roan J. Stat. Mech. (2007) P09021

  3. Molecular spiders in one dimension

    Tibor Antal et al J. Stat. Mech. (2007) P08027

  4. Dynamic heterogeneities in critical coarsening: exact results for correlation and response fluctuations in finite-sized spherical models

    Alessia Annibale and Peter Sollich J. Stat. Mech. (2009) P02064

  5. Entropic elasticity of double-strand DNA subject to simple spatial constraints

    C Bouchiat J. Stat. Mech. (2006) P03019

  6. The four-spinon dynamical structure factor of the Heisenberg chain

    Jean-Sébastien Caux and Rob Hagemans J. Stat. Mech. (2006) P12013

  7. Structure of the stationary state of the asymmetric target process

    J M Luck and C Godrèche J. Stat. Mech. (2007) P08005

  8. Fluctuation relations and coarse-graining

    Saar Rahav and Christopher Jarzynski J. Stat. Mech. (2007) P09012

  9. On the law of increase of entropy for non-equilibrium systems

    A Pérez-Madrid J. Stat. Mech. (2006) P09015

  10. On the predictive power of local scale invariance

    Haye Hinrichsen J. Stat. Mech. (2008) P07026

Related review articles

What's this?
View review articles related to this research to gain an insight into the key trends in this subject area. Related review articles are selected based on PACS/MSC codes, and are no more than three years old.

  1. Imaging with LINC-NIRVANA, the Fizeau interferometer of the Large Binocular Telescope: state of the art and open problems
  2. Sequences close to periodic
  3. Some unsolved problems in discrete mathematics and mathematical cybernetics

View by subject




Export








Please login to access our web services, or create an account if you don't yet have one.

You must have cookies enabled in your web browser to be able to login.

Username
Password

Forgotten your password? Get a new one here.