An entropy analysis of the Cirebon language script using the Ternary Huffman code algorithm

Entropy is a statistical parameter that measures how much average information is generated for each symbol in a text. Each language usually has several important features that are hidden statistically and certain redundancy. These features can be utilized to form appropriate text compression tools for optimal use of resources. This study proposes an analysis of the entropy of the Cirebon language text for text compression using the Ternary Huffman Code algorithm. This entropy value then becomes the reference level of the Cirebon script compression level. The probability of each symbol in the Cirebon Regional Text is used to calculate the entropy value. The result shows the entropy of the Cirebon language script was 2.508 bits per symbol, with an expected code length of 2.565 bits per symbol. Estimated compression efficiency with Ternary Huffman Code is 97.77% and compression rate is 0.51308.


Introduction
At this time, the process of sending information directly is still often facing problems, one of them is caused by the large size of the data sent. Besides, storage media has a limited capacity. If the data to be stored on storage media is large, then the data cannot be saved [1].
The method of transmitting an information symbol sequence is very important because each message is ensured to use more than one symbol. If it is assumed that each symbol requires the same time for transmission, the time for transmitting messages is directly proportional to the number of symbols associated with it [2]. Each communication system follows the general process of transmitting messages from one point to another. The hidden statistical nature of the communication process was first recognized by mathematician, Claude Elwood Shannon. One of the most important features of Shannon's theory is the concept of entropy. With that concept, many sentences can be abbreviated without losing their meaning [3]. Entropy is a statistical parameter that measures how much information is produced on average for each letter of a text in a language. Each language usually has some important features that are hidden statistically and certain redundancy. These features can be utilized as a means of text compression that is suitable for the optimal use of resources [4].
Data compression is a technique to reduce the data size from the original data to be more effective and smaller in storage, efficient and faster in the process of transmitting data [1]. Data compression consists of two main processes, namely compression, and decompression. If a file is compressed, then the file must be read again after the file is decompressed [5]. There are two types of data compression, namely lossless data compression and lossy data compression [1,6,7].
There are many methods for data compression, including Huffman, Shannon, Shannon-Fano, Runlength Encoding, Run-length Huffman, Lempel-Zip-Welch (LZW), and several other methods [1,[8][9][10]. One decodable compression method is the Huffman code [11]. The Huffman code method uses the principle that each character is coded with a series of several bits. The characters that appear often will be given the shortest bit set while the characters that rarely appear will be given the longest bit series [12].
This paper discusses the efficiency and compression rate of the Cirebon language script using Ternary Huffman code based on entropy. Cirebon is chosen because it is a regional language that has intersection of Sundanese and Javanese. This entropy value then becomes the reference level of the Cirebon script compression level. The research results can be used as a basis for developing local language-based compression applications.

Method
The study began with a statistical analysis of the Cirebon language script. The total characters/symbols sampled are 1544 consisting of 25 types of symbols that appear with different frequencies. Probability P is calculated by: where P is the probability of symbol, F is the frequency of symbol, and N is the total of symbol. After the frequency and occurrence probability of each symbol are obtained, then the coding is done with the Ternary Huffman Code algorithm. Huffman Code was built using a tree structure with three branches. These tree structure is called a ternary tree. First of all, the probability data from the text is sorted from the largest to the smallest. The three symbols that have the smallest probability are then combined to form a new node. The sum is done for 3 symbols or nodes that have the same probability value or the upper probability value is higher than the lower probability value. These nodes are added together with other nodes until a complete Huffman Tree is obtained with the number of probabilities is 1. Codeword formation for each symbol is done by giving a value of 0 for the upper branch, 1 for the middle branch, and 2 for the lower branch. Giving a codeword for each symbol is done by taking a sequence of bits from the root to the leaves [1].
From the frequencies (F) of each symbol that appears and the probability (P) of each symbol, entropy (H), Codeword (C) and codeword length l(x) can be obtained, which can ultimately be used to calculate efficiency and compression rate of a Cirebon manuscript.

Ternary Huffman code
The Ternary Huffman tree is developed based on the frequency and probability of each symbol that has been analyzed. The Ternary Huffman tree can be seen in Figure 4. Meanwhile, the expected code length L(C) is obtained from the number of multiplications between the probability and the codeword length of each character.  The complete statistical analysis data can be seen in Table 1.

Results analysis of ternary Huffman code
In this study, the total symbols sampled were 1544 symbols with 25 symbols appearing. The symbols "A", "SPACE", "n", "e", and "I" have the highest probability values, so that they have the shortest code length of 2 bits with successive codewords 00, 01, 02, 10, and 11. While the symbol "-" has the lowest probability value, so it has the longest code length of 6 bits with codeword 222220. Entropy obtained is 2.508 bits per symbol with an expected code length of 2.565 bits per symbol and an efficiency of 97.77%. The analysis only considers the symbols that appear. Compression rates Rc is calculated by comparing the expected code length L(C) resulting from the Huffman codes with the expected code length of the original file L(n) using In this study, the expected code length of the original file refers to a standard equal length value of 5 bits. A value of 5 is enough to cover the total number of symbols which is only 25 symbols, so, using equation (5)

Conclusion
Based on the analysis that has been done, the entropy of the Cirebon language script with Ternary Huffman Code is 2.508 bits per symbol, with an expected code length of 2.565 bits per symbol. So that the estimated efficiency of compression with Ternary Huffman Code is 97.77% and the compression rate is 0.51308. These results can be considered in the use of Cirebon language compression for the telecommunications transmission process.