Big Data Management Using Hadoop

Today, one of the key issues is the design of systems and software to deal with the storage, management and processing of large amounts of data as a result of the exponential rise in data. In unstructured forms, these data are found. Due to the large and complex data sizes, data management with traditional approaches is unacceptable. Hadoop is an appropriate solution for the continuous growth of data sizes. We have suggested in this paper techniques and algorithms dealing with big data including data collection, preprocessing of data. The Fragmentation algorithm will take the function of a distributed implementation of the traditional file system time-sharing model, where various users share files and storage resources. Also, in this research we used a framework to improve the performance of a query and reduce the response time called the HADOOP. The Apache Hadoop project for safe, scalable and distributed computing. The results showed that Hadoop is the best way to deal with big data during calculating the rate of response time of a complex query for example at (00:00:01) per second and comparing it with the response time of the same queries on the fragmentation algorithm at (00: 01:11) per second and the standard database at (00:05:13) per second. We concluded that Total time Access for complex queries in distributed processing is faster than in non-distributed processing.


Introduction
At present, large volumes of data in heterogeneous sources (e.g. commercial and academic, finance) seem unimaginable. The abundance of smart computers and things on the Internet would make them a very technological character. Clear behind -the-scenario systems and distributed applications support multiple overlapping systems (smart grid systems for example) [1,2]. Big Data is used to describe huge amount of structured and unstructured data which is large. It was very difficult to process the data in the Big Data using the traditional databases and software technologies. [3].
Using most relational database management systems, desktop stats and simulation packages, it is hard to operate with enormous information. Instead, it needs "big parallel programs operating on dozens, hundreds or even thousands of servers". What is regarded to be" huge data" varies depending on the group management organization's capacities and the application capacities traditionally used to process and evaluate the data set in its own domain. "For some organizations, addressing hundreds of gigabytes of data for the first time may lead to a reconsideration of data management options, and for others it may take tens or hundreds of terabytes of data before the data volume becomes an important issue. A possible solution is to found a technique for distributed data file system (Fragmentation algorithm), using Hadoop that decrease above limitations and the summation of total query maintenance cost and response time of the selected views which is regarded the view selection problem. The researchers have found algorithms and frameworks that solve the problem to reach the best solution of distributed System. The main contribution of this work is to achieve: 1-high query performance, 2-improving scalability and availability; 3-integrating data from multiple sites; integrating heterogeneous data; 4-offloading batch processing. Therefore, the term Big Data refers to vast amounts of data that come from different sources. Hence Big Data refers not only to this huge data volume but also to the diversity of data types, delivered at various speeds and frequencies [4]. See figure 1.

Related Works:
In 2015, their work is called Jie Songa, Chaopeng Guoa, Zhi Wanga, YichanZhanga, Ge Yub and Jean-Marc Piersonc (HaoLap: A Hadoop-based Big Data OLAP System) in this paper HaoLap (Hadoopbased oLap), Big data application OLAP (OnLine Analytical Processing). Designed a hadoop-based OLAP and applied multiple algorithms to each specific work to roll-up dimensional hierarchy operations using dimension coding and traverse algorithms Dimensions and measurements were then stored using the partition and linearization algorithm. Effective results in OLAP and complex query. [5].
In 2016, their research is named Ikbal Taleb, Hadeel T. El Kassabi, Mohamed Adel Serhani, Rachida Dssouli, and Chafik Bouhaddioui (Big Data Quality: Evaluation of Quality Dimensions). Data is proud of being the most precious asset firms. The effects are unpredictable when their quality degrades and can lead to full misunderstandings. By implementing sampling strategies on big data sets, they suggested an effective data quality assessment system. The sampling will decrease the information size to a representative population sample for quick quality assessment. The findings showed that the mean sample quality score was representative of the initial information and illustrated the significance of sampling in order to decrease the cost of computing when evaluating big data quality [6].
In 2017, their work (DATA WAREHOUSE AND BIG DATA INTEGRATION) was called Sonia Ordonez Salinas and Alba Consuelo Nieto Lemus. Different opinions on warehouse data and large data Many believed that database data had vanished with the advent of large data, while others have achieved the fusion of the two by finding the intersection points and the gap between them and the combined tasks. They proposed models, for example the integration of the target, the technology used, the architecture and other common characteristics. [7].
In 2017, their work was called Xiaolei Li, Zhenyu Tu, Quanchao JiaXinjiang Man, Hui Wang and Xiuli Zhang (Deep-Level Big Data Quality Management) By using large data analysis to improve performance and raise rates, new business opportunities can be gained. The data analysis was carried out using industrial companies and the off-line data reference model library was developed. By using Spark to introduce the web application, we conclude that unlabeled data can be processed in real time in the manufacturing sector [8]. In 2018, their work was called Konstantinos Vassakis, Emmanuel Petrakis and Ioannis Kopanakis (Big Data Analytics: Applications, Perspectives and Challenges) The enormous increase in data varies from generation to generation. The increase of industrial enterprises in the previous generation, People and advanced technology has led to companies competing with each other, but the growth is now the result of rapidly growing Web and social networking sites. The ability to achieve, evaluate and action on statistics ("data-driven decision systems") It is an effective system that can benefit governments and is distinguished as an important resident [9].

Characteristics of Big Data:
Big data can be described by the following characteristics: 1-Volume The volume is one distinctive which needs to be considered while dealing with Big Data. The amount of created and stored data. It is expected that the volume of data around the world will reach 40 ZB by 2020.

2-Velocity
In this context, Emerging information will be faster than before, Data is produced and processed to satisfy growth and development requirements and difficulties. Big data are frequently accessible in real time.
3-Variety It refers to the data's type and nature. This enables individuals analyzing it to use the resulting understanding efficiently. Big data derives from text, pictures, audio, video, and through data fusion complete missing pieces.

4-Veracity
The data quality of captured data can vary greatly, affecting the accurate analysis [10].

Big Data and Hadoop
Approximately more that 44 times of data will be available over the next generation than today. It is also apparent that there is some restricted capacity in every data storage even in grid [11]. In order to manage this big quantity of data, it is necessary to divide the issue into components and to distribute these components to various computers in parallel. Whenever in collaboration various machines are used, the possibility of failure increases. Faults are expected and common in a distributed setting. In case of switches and routers break down, networks may face partial or complete failure. Data may not arrive at a specific stage in time due to the network congestion issue. Individual compute nodes also suffer from mistakes such as overheat, crash, hard drive failures, or memory or storage space failures. Data may be corrupted, transferred maliciously or incorrectly. Multiple implementations of client software can talk to slightly distinct protocols. The distributed system should be able to recover and proceed to create progress from the component failure or temporary error condition. Hadoop  4 unstructured records and occurrences. In specific, with variable structures (or no structure available), Hadoop can process highly big quantities of data [15]. See figure 2. Hadoop's suppliers from Cloudera or MAPR, then distribution is also free, but they only charge for assistance [16].
2-Flexible: Without any sort of system down, we can edit or delete a node. Hadoop is schema-free and can absorb any kind of data from any amount of sources, organized or not [16].
3-Scalable: A cluster can be extended without having to relocate, reformat, or alter dependent analytical workflows or apps by adding fresh servers or resources. It can handle any amount of data where only 10 TB of data could be handled by the ancient system [17].
4-Reliable: The system is constructed in such a manner that you still have data accessibility when there are failures [17].

The Proposed System
The proposed system illustrates the main steps from data collection to results obtained using the following algorithms and techniques.

Distributed Processing.
A major challenge in dealing with large data is the distributed file system as it uses several computers connected to each other using any available networks and in the case of a particular request, these computers will be sent and the immediate reply will be responded to, thereby saving time in data recovery.

Data Fragmentation
To handle large data, the data is distributed to multiple computers either horizontally or vertically in compliance with the Fragmentation algorithm and then handled with Client-Server architecture in need of a comprehensive OLAP complex. As shown in algorithm 1.

Replication of data
Replication is one of the technologies used to copy the data to more than one site to maintain in the case of loss of data from the designated place because it is located in the other and used with the process of fragmentation as integrated work in the architecture of Client -Server therefore, the data is stored more accurately and provide more data and give a detailed report of anything whether homogeneous or not.

Network Regulation
Distributed data operation within the network environment, where possible, should be within the area of building (LAN) or city(MAN). implementation of the system was based on an internal network (LAN) within organization building. The work will be in the architecture Client -Server.

Dealing with Hadoop
Hadoop became a fundamental data management platform for big data analytics uses. The stack of Hadoop technology has become the de facto standard for analyzing big data. There are many tools in the Hadoop ecosystem HBase, Storm, Pig, Hive, Oozie, and Ambari, to name just a few. We can definitely use Azure VMs to create your own custom Hadoop solution. Or we can use the Azure platform to provide and handle one for us via HDInsight service. We can even use Windows or Linux to deploy HDInsight clusters. It can be a significant time saver to provide Hadoop clusters with HDInsight (as opposed to doing the same manually).

1-System Requirements
We need Internet access and a browser (Internet Explorer 10 or greater to access Microsoft Azure and use Visual Studio to display ideas used in Azure applications development. So we're going to need Visual Studio 2019. The requirements for the system are: • Windows 10 • Computer that has intel (R)core(TM)i7-8550ucpu @1.80GHz 2.00GHz or faster processor (2GHz recommended).
• When the SQL Server Management Studio Connect to Server dialog box opens, enter the complete server name, Select SQL Server Authentication and provide the administrative login and password that we set when the database was created.
So Microsoft will upload the SQL database to azure. HD Insight Hadoop will deal with a SQL database; Hadoop will respond very rapidly when we ask for any complex query.

Implementation of the System
Interfaces display the execution of the designed system and the results of the execution are discussed. The results are also presented in tables and displayed through interfaces, and graphs that are a kind of visualization tool represent the final results

Distributed processing
The machines in a distributed system interact with each other through different media, such as highspeed networks or local network. They have no main memory or disks to share.
• Data Fragmentation If the LSalesDB -relationship is fragmented, LSalesDB will be divided into a number of db1, db2, db3, db4. These fragments have enough information to reconstruct the original relationship. This reconstruction can take place either through the Union procedure or through a unique type procedure on the different pieces. There are two distinct schemes (Horizontal & Vertical) for fragmenting a relationship. So, we have divided the original data into four sections (db1, db2, db3.db4), each section into client (client 1, client2, client3, client4) horizontally. As shown in figure 3. In Network Regulation Distributed data process within the network environment. The four computers are referred to as client1, client2, client3, client4 connected by the network (LAN or MAN), the one is working as a server, the rest four computers as a client as explained in chapter 3. As shown in figure 4. In order to connect to four Client, we will set for each Client local host and serial port starting from 5000, 6000 ,7000 and 8000 sequentially and then press start button. Now go to the original server and press the Execute button will show the next main interface then we click on any complex query found within the appendix (B) The results will be shown as in the following figure 5. Note that each Client (Storage Node) will take 860.160 kb of data. • Data Replication Replication is a collection of techniques to copy and distribute information and database items from one database to another and then synchronize databases in order to preserve consistency. As shown in figure 6 (a, b).

Interfaces of the Implementation Hadoop on SALES System
Faults are expected and common in a distributed setting. If switches and routers or local host break down, networks may face partial or complete failure. Data may not arrive at a specific stage in time due to the network congestion issue. Individual compute nodes also suffer from mistakes such as overheat, crash, hard drive failures, or memory or storage space failures. Data may be corrupted, transferred The SQL data base (SALES DATA) will be uploaded to the cloud computing. And the connection to the database is by the following connection: (Server=majida.database.windows.net;Database=LSalesDB;User Id=majida;Password=majida123456;Connection Timeout=0;) Now we will click on the start button in above window will show the main window of the Hadoop and then click on any button of complex query will appear results below as shown in the following figures 7,8.

Results and Discussion
By applying the proposed system algorithms on the sales system data we found: • Response Time of Query The query response time in the OLAP and decision support systems is critical and very important. By applying distributed processing algorithms to the sales system We concluded that when processing large data time saving (i.e. the system requires a few minutes), high quality and data retrieval speed. Therefore, the implementation of the query on the distributed processing(Fragmentation) provides us with fast response time and speeds up decision making. For example, if we implemented queries fifteen that were implemented by each Client separately in distributed processing would be faster compared to if they implemented themselves without distributed processing. But if we use the Hadoop in the execution of these queries will be much faster because the Hadoop already deal with large data and analysis. Hadoop depends on the speed of the Internet and the complexity of query. As shown in the table 1 and figure 9.

Conclusions
By studying distributed data file system (Fragmentation algorithm) and using Hadoop theoretically, besides the practical part, the implementation of proposed algorithms and results discussion. We obtain the following conclusions and recommendations for the future work.
1. The main idea behind designing this system and implementing it is to get the optimizing queries performance and minimum response time to answer complex, ad-hoc queries. 2. The use of Hadoop is the best solution for dealing with big data which is preferred over all other methods. 3. Big data storage is largely dependent on Hadoop. 4. Hadoop is an open source software project that enables big datasets to be handled distributed across service server bunches. It is regarded to scale up to thousands of techniques from a single server with a very great mark of tolerance of fault. 5. The assessment focused on some aspects of data quality such as completeness and reliability. Out of the results of the present work and depending on our expectations and viewpoints about the research topics and to make the subject more important, reliable, transparent for future development, we propose the following recommendations: 1. Potential of using machine learning with the Hadoop instead of using the alone Hadoop to answer complex queries as in our current system. 2. The possibility of using social media database such as (Facebook or twitter) instead of using videos, digital and texts data and implement the current system on it. 3. Implementation of the proposed system using spark apache. 4. Implementation of the proposed system using java environment. 5. using Classification algorithms like Support Vector Machines (SVMs), neural network for answering new entered queries.