The Speech multi features fusion perceptual hash algorithm based on tensor decomposition

With constant progress in modern speech communication technologies, the speech data is prone to be attacked by the noise or maliciously tampered. In order to make the speech perception hash algorithm has strong robustness and high efficiency, this paper put forward a speech perception hash algorithm based on the tensor decomposition and multi features is proposed. This algorithm analyses the speech perception feature acquires each speech component wavelet packet decomposition. LPCC, LSP and ISP feature of each speech component are extracted to constitute the speech feature tensor. Speech authentication is done by generating the hash values through feature matrix quantification which use mid-value. Experimental results showing that the proposed algorithm is robust for content to maintain operations compared with similar algorithms. It is able to resist the attack of the common background noise. Also, the algorithm is highly efficiency in terms of arithmetic, and is able to meet the real-time requirements of speech communication and complete the speech authentication quickly.


Introduction
Speech messages as human communication means is the most natural, effective and convenient. Hash function is based on the perception of human psychology multimedia information processing theory, multimedia data is unidirectional mapped to a data set of multimedia perception summary sets. Only the perceptions of simple and fast hash function to meet multimedia massive data analysis applications. Therefore, hash function is very suitable for application in the field of information security [1] .
At present, the main methods of speech perception hash feature extraction and processing are as follows: Mel frequency cepstrum coefficient(MFCC), Hilbert transformation, linear prediction coefficient, the buck with energy [2] , the energy ratio of wavelet features [3] and spectrum energy feature [4] can also be used as the characteristics of the speech perception hash values. In literature [5], a perceptual hashing algorithm based on Mel cepstral coefficient combined with linear prediction cepstral coefficient is proposed, which has good robustness and tampering localization, but it is not effective in differentiating and content preserving operations.The literature [6] proposes a speech-aware hashing authentication algorithm based on the MFCC correlation coefficient, which considers security, it has good robustness and discrimination, but is not efficient. The literature [7] Hilbert transform spectrum estimation method is used to implement robust speech feature extraction, construct hash function perception. In literature [8], a perceptual hash algorithm based on RT and DCT is proposed, which has good robustness and high computational efficiency, but is poor in distinction. The literature [9] proposes a multi-format audio perceptual hashing algorithm based on dual-tree complex wavelet transform, which 2 1234567890''"" has higher efficiency and achieves audio authentication in five different audio formats, including original domain and compressed domain, and achieves small-scale Tamper detection and positioning, but without considering the security issues, weak robustness.
The perceptual hashing must satisfy discrimination and robustness. Discrimination is an indicator which should reflect the data content in a unique way. The speech have different content should yield different hash sequences. Robustness is used to evaluate the identity of the speech after the content maintain operation, such as low-pass filtering, volume improvement, volume reduction, resample. Robustness is used to evaluate whether the speech after the operation is the same as the original speech.

Tensor decomposition
Tensor in the field of information processing has been widely used. Tensor can be considered as a product of vector space, and it is a higher order generalization of vector and matrix. The order of the tensor can be expressed as . The tensor decomposition includes two kinds of decomposition methods: CANDECOMP/PARAFAC and Tucker [ 10 ] . CANDECOMP/PARAFAC decomposition and Tucker decomposition are the high order generalization of the singular value decomposition in the matrix decomposition [11] .

Wavelet packet transform
Wavelet Packet Transform (WPT) as the further expansion of wavelet analysis theory. Wavelet packet decomposition can reflect the feature and nature of the signal, very suitable for the analysis and processing of speech signal types of non-stationary signal [12] .

LPCC Feature Coefficient Extraction
The recurrence relation between LPC coefficient and LPCC coefficient is below: c a n N n k c c a n N n

LSP feature parameter extraction
Linear spectral pair (LSP) is a mostly used as the feature vector. LPS can be obtained by solving the conjugate complex root of P + 1 order symmetric and antisymmetric polynomials.
The roots of the above two polynomials are on the unit circle and appear alternately. Where () Pz has a root of Qz has a root of 1 z  . () Pz and () Qz have 2 M conjugate complex roots located on the unit circle. So get a new polynomial.
Where cos , 1, 2, , 2 is the representation of the LSP feature in the cosine domain.

Speech perception hash authentication scheme
The flow chart of speech perception Hash authentication is shown in figure 1.
Step 1: preprocessing, framing and windowing: conduct pre-emphasis on the speech in the speech library to be tested, enhance the useful frequency spectrum of high frequency, reduce the edge effect and eliminate noise.in order to eliminate the inter frame loss during framing, conduct framing and add the Hamming window for speech ) (t x ; during framing, the frame length is L; when the frame moves at /2 L , () sn can be obtained; later, add Hamming window for () sn to obtain () w sn . n is the frame number.
Step 2: construction of speech feature tensor and decomposition of speech tensor: carry out wavelet package decomposition for the speech frame, calculate the MFCC and △MFCC features of each frame, conduct the tensor construction of feature feature to obtain the speech feature tensor χ . Carry out Tucker decomposition for the feature tensor χ to obtain the low-dimensional core tensor G; Step 3: quantization: Reconstruct the core tensor G to form the two-dimensional feature matrix, calculate the sum of each column of matrixes.
In the formula above, () n ij H signifies the feature feature in row j and line i; k is the number of rows of feature matrix. Quantize the feature formed and row matrix to form the Hash value () hjof speech segment, ˆh R is the mid-value.
Step 4: calculation and matching of perception  (10) if the perception contents of the two speech segments  and  are the same.
In the formulas above,  is the matching threshold.

.1. The experimental environment
As for the experiment in the paper, the MATLAB 2010b simulation realization is adopted. The speech of the length 4 seconds 1189 segments, in which different content of speech contains Chinese and English and the same content of different people read speech. Speech parameters used for the sampling rate 16000Hz, bit rate is 256kbps, the number of channels is mono, sampling precision is 16bit, and format is wav. Frame length is 20ms, frame shift is 10ms.

Discrimination
The perception hash gain 816003 bit error rate data .The obtained bit error rate of the normal distribution as shown in figure 2.
The bit error rate of perception hash value of different speech content mainly obey normal distribution, where probability distribution parameter is μ (μ=0.4971),and Standard deviation σ=0.0274.
As can be seen from table 1, the algorithm deterrent rate in this paper has no more obvious improvement than the deterrent rate based on LPC speech perception hash algorithm.

Robustness
The average bit error rate can be obtained due to the attack to speech library. as shown in figure 2, FRR and FAR curve has obtain BER, extracted from the content of the same speech perception of the hash value and BER are below the threshold value of 0.35, the experimental results show that this algorithm has higher robustness. And FRR -FAR curve in the figure is not crossed, this algorithm also has good distinguishability and robustness. It can accurately identify their content and content of malicious operation. According to table 2 shows that when the threshold τ= 0.35, FAR=3.8185e-008.
As can be seen from the figure 3, discriminant threshold of algorithm in this paper has arrange between 0.33 and 0.41.As can be seen from table 2, In addition to 20dB Gaussian white noise , the average bit error rate of the above several content keeping operation are below decision threshold 0.35, increasing and decrease the volume does not change the channel model. Therefore, the coefficients greater change will not happen, so adjust the volume does not produce the error rate. Due to the algorithm using parameters normalization which can improve the resistance to Gaussian white noise，the algorithm robustness in the paper against noise is great.  3.8185e-008

Conclusion
In order to solve the balance of authentication efficiency, the content remain operational robustness and the discriminative.an efficient and robust speech authentication algorithm based on perceptive hash of tensor decomposition model is proposed. Through theoretical analysis and experimental analysis, the proposed algorithm has good anti-collision performance and robustness which can effectively resist the content preservation attacks and common background noise. The algorithm can authenticate speech information and speaker information more accurately. We only consider the matching of speech in the speech library when the algorithm is built. Personal speech habits are different, and feature extraction is complex. Post-research needs to consider the extraction the unique characteristics of personal speech, the speech authentication which is not related to speaking time, the establishment of hash value database and so on. Therefore, the speech perception hash algorithm, which is unrelated to the speaker is the direction to be studied. [13]