Information Retrieval System to Find Articles and Clauses in UUD 1945 Using Vector Space Model Method

This study aims to find articles and clauses from the 1945 Constitution (UUD 1945) using the Vector Space Model method that calculates the similarity of many documents. One document is represented by one clause from each article of the 1945 Constitution. The next step is pre-processing by deleting unnecessary words (stopwords) and changing it into basic words (stemmer) in the Indonesian language. Each document will be indexed to speed up query and simplify the weighting. Words weighting in documents is performed using the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm by calculating the frequency of words in documents and all documents. The document search results will be presented in the ranking with the largest number of scoring appears at the top (descend sorting). The word search in this system more or less takes 90-100 milliseconds in 73 documents.


Introduction
The 1945 Constitution of the Republic of Indonesia, or abbreviated as the 1945 Constitution, is a written basic law, the constitution of the current government of the Republic of Indonesia. The 1945 Constitution was ratified as a state constitution on August 18, 1945. During the period of 1999-2002, the 1945 Constitution had been amended 4 times. The amendment changed the composition of institutions within the constitutional system of the Republic of Indonesia. Information retrieval is a method for rediscovering unstructured data stored in a set of documents and subsequently providing information about the subject needed. The purpose of this information retrieval system is to provide the information required by the users by getting all relevant documents that users need and also discard most of the irrelevant documents. Users can find relevant information by reading all the documents in their storage, storing relevant documents, disposing of irrelevant documents, and sorting documents as needed. It is a perfect information retrieval system, but this solution is not practical and efficient because users do not have enough time to read all existing documents [1]. There are several methods of document information retrieval systems. They are the Gravitation Based Model, the Latent Semantic Model, the Vector Space Model, and the Generalized Vector Space Model. A good model of information search allows users to determine quickly and accurately whether the contents of the documents received are exactly what they need. The Vector Space Model (VSM) is a technique used to represent documents and queries as vectors in multidimensional space. The dimensions are the terms used to build an index to represent the documents. It is the most widely used technique for information retrieval due to its simplicity; efficiency over extensive document collections, and it is very appealing to use. The effectiveness of the VSM depends (2) the weighting of the indexed terms to enhance retrieval of document relevant to the user; and (3) ranking the document according to the query based on the similarity measure [2]. The implementation of the VSM has been used in various scientific fields such as Computational Linguistics [3], Expert Systems [4], Medical [5], Knowledge-Based Systems [6], Data and Knowledge Engineering [7], and so on. Research on Information Retrieval System using VSM method has been carried out to search the textbased thesis documents by Ahmad Fauzi and Ginabila [8]. Another research on the "final assignment" search and document similarity calculations in abstracts have been conducted by Putri E, Martono D, and Lis D [9]. This study aims to find articles and verses from the 1945 Constitution (UUD 1945) using the VSM method. It performs the calculation on the similarity of a large number of documents. One document is represented by a clause from each article of the 1945 Constitution. VSM performs a text/word search statistically by calculating the weight value of the word for each document queried with TF-IDF formula then calculate the similarity of values in the query and all documents stored with the Cosine Similarity formula [10]. The research finds that the system is designed and built to produce relevant results, efficient and easy to implement, and compatible with all operating systems using Java programming.

Data set
Data is obtained from Legal information network documentation of House of Representative of the Republic of Indonesia website. The study utilizes 73 text files with the total file size of 32.6 KB.

Literature study
The literature study was carried out by studying the theories related to the object of the study that are found in articles, journals, and books. These resources are used as the basis for this research to find the theoretical basis of what has been previously carried out. The theories studied are Information Retrieval System, Text Mining, VSM, Term Frequency-Inverse Document Frequency (TF-IDF), ECS Stemming, Java Programming, and others subjects.

Data collection
The study was carried out using data from the 1945 Constitution. The data is then converted into articles. One article is saved in a .txt format file. Samples from each file are as follows:

Data processing
Data processing in this study is illustrated in the flowchart diagram below:  Reading, the process of reading the text file.  Tokenizing, a process that aims to sort out strings from titles or keywords based on each of the composition words.  Filtering, a process that aims to eliminate non-essential words, such as conjunctions, adverbs, and others.  Stemming, a process that aims to change words into basic words by eliminating the first and final affixes.  Indexing, the word indexing process from a collection of words in order to speed up data search.  Term weighting is the process of weighing a word or term based on the number of frequencies and the existence of the word in the documents of system. The method used in the term weighting is Term Frequency-Inverse Document Frequency (TF-IDF) [11].

Data storing
The results of data processing are stored in memory that arranged by Map objects in Java. Data stored in memory enables faster searching compared to the data stored on the hard disk.

Information Retrieval System with a Vector Space Model
The indexing results are calculated by the level of similarity with the query using the VSM method. The steps are as follows:  Stemming query, this process is similar to stemming in the data processing. This process aims to convert the query word into a basic word so that it can be found in indexing data.  Calculating word weights in documents using TF-IDF (Term Frequency-Inverse Document Frequency) formula. IDF calculation is added to 1 so that the log value is not 0, so if multiplied by TF (total frequency), then the weight value becomes 0.

Data Indexing
During the processing of the files, each file is read and filtered by removing the stopword not required in the indexing. Furthermore, by stemming initial word becomes shorter.  Table 2 is a Top 40 summary of indexing on stored words. In Word column are columns of stemming words. Frequency column is a column for the number for the frequency of certain words in all documents. Document Total column is a column for the number of documents that have a particular word. The number of words stored in indexing from 73 documents is 486 words.

Results of Query
Query testing is three steps by increasing the number of query words for each step as follows:  Documents that match the query.  A score of the similarity of the query with the document.  The time needed to search.

The First Testing
This test search for a query word "mahkamah" in the search system. The test results are as follows:  Figure 2. First testing result.

The Second Testing
This test search for two query words "mahkamah" and "konstitusi" in the search system. The test results are as follows:

The Third Testing
This test search for three query words "mahkamah", "konstitusi", and "hakim" in the search system. The test results are as follows: From the three tests above, the comparison of query time and documents total are as follows:

Results of Query
Based on the three tests above, the more query words are displayed, the more proper documents in the retrieval system (3rd test) appear. It happens because the search system will look for documents that have a minimum relationship of one query word. More query words mean more documents related to the query. The number of query words does not influence the score because the score depends on the suitability of the query word with the document. The score will be higher when the document relates with more query words. The lowest score means that the document only has 1 query word. The number of query words has no effect on the search time in the system because the indexing data has been stored in memory. If the search system adds more documents, the longer the search time will take, and the slower the process because the number of objects in memory is higher.

Conclusion
From the research have done can be concluded as follows: 1) The application of the Information Retrieval System uses VSM method can be done for Article and Clause in the 1945 Constitution.
2) The retrieval system can search for information from the contents text files quickly without being affected by a large number of query words.
3) The stemming process of keywords makes it easy to search between words in the query and in the document. 4) The VSM method calculates similarity based on the statistical value of the words in the document, so it has the disadvantage of searching interrelated words or phrases. There are many methods for the retrieval system; it is expected that the researcher can use the method for the same object and compare the best retrieval method in that case.