The potency of the efficiency and compression rate of Cirebon language script using the Binary Huffman algorithm

Each language usually has several important features that are hidden statistically and certain redundancy. These features can be utilized to perform text compression that is suitable for the optimal use of resources. This study proposes an analysis the potency of the efficiency and compression rate of Cirebon language script using the Binary Huffman algorithm. The analysis of the potency is based on the entropy of the Cirebon language script. The study begins with an analysis of the Cirebon language script to calculate the probability of each symbol. These probabilities are used to calculate the value of entropy. The results showed that the entropy of the Cirebon language script was 3.976 bits per symbol, with an expected code length of 4.02 bits per symbol. Then, the estimated efficiency of compression with the Binary Huffman Code is 98.89% and the compression rate is 0.80402. These results can be considered in the use of Cirebon language compression for the telecommunications transmission process.


Introduction
In a language, there are symbols that appear more often than other symbols or symbol pairs. Frequently or not a symbol or symbol pair appears, it can be seen by looking at the frequency distribution [1]. Language has potential in the application of communication technology, especially related to source compression [2]. In his paper, Khoirul et al analyzed the potential of source compression for Indonesian local languages namely Sundanese, Javanese, and Balinese based on outage probability using the Slepian-Wolf coding theorem [2].
Source compression can provide changes to the size of the data that can reduce data duplication so that the data representation becomes less than the original data [2,3]. The provision of source compression also answers the challenges of the future telecommunications needs which are expected to bring up many beneficial applications [4].
Information technology exchange in the future is expected to be more effective and efficient that can provide changes for better speed. Characters and grammar in a language usually differ from one language to another. The frequency of the appearance of one language symbol can be more than another symbol [1,5,6]. Because of this nature, every language has the potential for efficiency to be used in future communication. This paper discusses the potency of the efficiency and compression rate of Cirebon language script using the Binary Huffman algorithm. Cirebon language was chosen because it is a regional language that has slices with Sundanese and Javanese but still has its own uniqueness. Besides the Huffman Code, there are several other methods for data compression including Arithmetic, LZW, and Run Length Encoding [7][8][9].
Huffman codes are used in research because Huffman has decodable properties [10,11]. Decodable means that the data that has been compressed can be returned to the initial data [2,7]. The aim of this research is finding out entropy, labeling patterns using the Huffman algorithm code to get the expected length, and efficiency as an analytical material to find out the compression level of Cirebon language text. It is expected that in the future the results obtained can participate in developing information theory in the planning of language communication in future communication.

Method
The study began with a static analysis of Cirebon text. The total characters/symbols sampled totaled 1544 consisting of 25 types of symbols that appear with different frequencies. Probability is calculated by Entropy calculations are performed using: where p(x) is the probability of the symbol appearing. The next stage is making binary trees based on the Huffman Code Algorithm. The two symbols with the lowest probability add up to form a new node. Repeat the same thing for the other symbols. The sum is done for two symbols or nodes that have the same probability value or the upper probability value is higher than the lower probability value. These nodes are added together with other nodes until a complete Huffman tree is obtained with the number of probabilities 1. The codeword for each symbol is performed by giving a value of 0 for the upper branch and giving a value of 1 for the lower branch. Giving a codeword for each symbol is performed by taking a sequence of bits from the root to the leaf. The results of making binary trees are codewords and length codes (bit lengths of codewords) for each symbol in Cirebon script.
The expected code length calculation L(C) is performed using where ( ) is the code length and ( ) is the probability of the symbol in Cirebon script. Efficiency  is obtained by comparing the entropy results with the expected code length that has been obtained previously using Compression rate calculations are performed by comparing the expected code length L(C) of the encoded text with the standard expected length L(n) where the standard expected length ( ) is a value that can cover all symbols in the text used for Cirebon language is ≈ 5.

Results and analysis
From the number of frequencies (F) of each symbol that appears and the probability (P) of each character, an entropy value (H) will be obtained. Codeword (C) and code length l(x) will be obtained from the Huffman Tree as seen in Figure 1. While expected code length L (C) is obtained from the number of multiplications between the probability (P) of each character and codeword length L(x).

Figure 1. Binary tree Huffman code.
Based on the Huffman Tree, obtained data contained in Table 1. In this study, the symbols "A", "SPACE", and "n" have the highest probability values, so that they have the shortest length codes of 3 bits with codeword 000, 001, and 110. While the symbols "c" and "-" have the lowest probability value so it has the longest length code of 9 bits with codeword 010110110, and 010110111. Entropy value is obtained using Equation (2).  Table 1. Analysis result data. In this study, the expected original file length code refers to the standard equal length value of 5 bits. A value of 5 is enough to cover the total number of symbols which are only 25 symbols. Thus, using Equation (5), the Compression Rates are equal to = 4.020 5 = 0.80402.

Conclusion
Entropy Cirebon language script with the Binary Huffman Code of 3.976 bits per symbol, with an expected code length of 4.020 bits per symbol. So, the estimated efficiency of compression with the Binary Huffman Code is 98.89% and the compression rate is 0.80402. These results can be considered in the use of Cirebon language compression for the telecommunications transmission process.