Kostadin Koroutchev and Manuel Cebrián J. Stat. Mech. (2006) P10009 doi:10.1088/1742-5468/2006/10/P10009
Kostadin Koroutchev and Manuel Cebrián
Show affiliationsCompression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. In some situations, there exists significant similarity with no literal shared information: text translations, different coding schemes, etc. To overcome this problem, we present a similarity measure that compares the redundancy structure of the data extracted by means of a Lempel–Ziv compression scheme. Each text is represented as a graph in which vertices are text positions and edges represent shared information; with our measure, two texts are similar if they have the same referential topology when compressed.
In this paper we give empirical evidence and a phenomenological explanation that this new measure is a robust indicator, detecting similarity between data coded in different languages.
We also regard a textual data without any structure, but with a common source, and find that we can detect such data and distinguish this situation from the previous one.
84.40.Ua Telecommunications: signal transmission and processing; communication satellites
Issue 10 (October 2006)
Received 1 August 2006, accepted for publication 27 September 2006
Published 18 October 2006
Kostadin Koroutchev and Manuel Cebrián J. Stat. Mech. (2006) P10009
Shi-shyr Roan J. Stat. Mech. (2007) P09021
Tibor Antal et al J. Stat. Mech. (2007) P08027
Alessia Annibale and Peter Sollich J. Stat. Mech. (2009) P02064
C Bouchiat J. Stat. Mech. (2006) P03019
Jean-Sébastien Caux and Rob Hagemans J. Stat. Mech. (2006) P12013
J M Luck and C Godrèche J. Stat. Mech. (2007) P08005
Saar Rahav and Christopher Jarzynski J. Stat. Mech. (2007) P09012
A Pérez-Madrid J. Stat. Mech. (2006) P09015
Haye Hinrichsen J. Stat. Mech. (2008) P07026