Integration of Oracle and Hadoop: Hybrid Databases Affordable at Scale

This work reports on the activities aimed at integrating Oracle and Hadoop technologies for the use cases of CERN database services and in particular on the development of solutions for offloading data and queries from Oracle databases into Hadoop-based systems. The goal and interest of this investigation is to increase the scalability and optimize the cost/performance footprint for some of our largest Oracle databases. These concepts have been applied, among others, to build offline copies of CERN accelerator controls and logging databases. The tested solution allows to run reports on the controls data offloaded in Hadoop without affecting the critical production database, providing both performance benefits and cost reduction for the underlying infrastructure. Other use cases discussed include building hybrid database solutions with Oracle and Hadoop, offering the combined advantages of a mature relational database system with a scalable analytics engine.


Relational databases for HEP and CERN
Relational databases are established technologies for processing workloads that are highly concurrent and transactional in nature and where data consistency is at premium. In particular relational database engines provide what has become over the last 3 decades the standard model for storing and retrieving structured data. Relational databases come with many of the features associated with enterprise-level data processing, including solutions for fault-resilience, backup, monitoring and troubleshooting.
Oracle, MySQL and PostgreSQL relational databases are used at CERN for a variety of use cases for high energy physics and for other services in the Organisation. Oracle databases for example are deployed as backend for online controls systems [1] and LHC experiments' conditions data [2]. Oracle databases have also an important role in the pipelines for offline processing, including distributed data management [3]. Oracle databases are also key for the functioning of the accelerators at CERN [4], this notably includes a database system ingesting data at rates in excess of 500 GB/day from a large number of data sources and control systems that are critical for the functioning of the LHC.

Hadoop ecosystem for analytics
The Hadoop ecosystem has emerged in the last decade as one of the leading platform for large scale distributed data processing, often referred to as "Big Data". Hadoop started as an open-source project to reproduce the architecture and functionality of systems built internally at Google for their data processing use cases [5]. It has since evolved and greatly profited from the contribution of many players in the open source community.
The Hadoop ecosystem currently consists of many components and associated tools used in data ingestion and data processing. For the scope of this article we want to stress the importance of HDFS, the Hadoop distributed file system and YARN, the cluster resource manager. Moreover tools in the Hadoop ecosystem that allow to store and query data with SQL are key to make the link between relational databases and Hadoop. Apache Hive, Apache Impala (incubating) and Apache Spark are main actors in this area. Data formats [6] also play an important role for storage optimization and performance. Columnar data storage formats like Apache Parquet exhibit an advantage over rowbased formats (the formats typically used in transactional databases) to optimize queries for analytics workloads. Notably, Apache Parquet with the addition of snappy compression is particularly suited for many of the use cases discussed here. Tools for data transfer and ingestion, such as SQOOP are also important to complete the data pipeline.

Hybrid databases
With the term hybrid databases in this article we refer to data processing systems that integrate Hadoop and RDBMS engines. Query execution on hybrid databases can physically happen on one or more of the systems involved. For example we consider the typical case of a user or application connected to an Oracle database and issuing a query through the Oracle API. In a hybrid database architecture deployment, the query can potentially query data both from Oracle and Hadoop systems, the underlying implementation is transparent to the user who only needs to operate with one API. This allows to exploit the best features of the two worlds: using Oracle for active data (that can be updated and there concurrent transactional processing can happen) and using Hadoop for scalable analytics on parts of the data that are immutable. In addition, Hadoop typically comes with advantages in cost/performance for storing and processing large amounts of data, as it can be deployed on clusters of commodity hardware using open source software. Hybrid database solutions provide a way to deploy solutions and APIs from Hadoop and open source communities into legacy applications written for RDBMS access without requiring major changes nor significant costs for enabling the new features.

Why integrating Oracle and Hadoop, use cases
Several use cases of interest for CERN and HEP community drive this research. For example databases that are naturally time-based in their data ingestion and that subscribe to the model writeonce read-many are potential candidates to profit from the scalability and improved cost of ownership of the hybrid solutions described here. This can become of increasing interest as the data volumes are expected to grow for many data sources related to LHC and processing of controls data in the coming years. Notable examples of database applications that fit in this description are those that serve as back-end for controls data of subdetectors and of the accelerators at CERN.
Moreover, several projects are currently under development at CERN and in HEP in general to utilize the features of Hadoop and its ecosystem for parallel processing and for advanced algorithms for data processing and machine learning. The techniques described here allow to link such new projects to the existing systems and data sources deployed using Oracle databases.

Related work
In this article we report on our tests for building hybrid databases based on integrating Oracle and Hadoop SQL engines. The tested solutions utilize Oracle components that are available with the Oracle license for the RDBMS (i.e. at no additional licensing costs besides the licensing of Oracle database server). Commercial solutions for integrating Oracle and Hadoop exist. In particular a startup company called Gluent Inc. has developed components that allow accessing data stored in Hadoop as Oracle external tables. This has several advantages compared to the solution based on database links and the Oracle ODBC gateway presented in this article. In particular Gluent provide with their solution a "smart connector" capable of pushing to Hadoop column projections, filter predicates, bind variable and join conditions. Moreover, Gluent connector has higher performance for data transfer Overall the solutions available on the market and in particular the smart connector provided by Gluent are of interest to ease the process of creating and maintaining hybrid databases for querying Hadoop from Oracle. The advantages of the commercial solutions include reducing the effort for the configuration phase and increased performance at scale. The extra functionality comes however with an additional licensing cost.

Overview
This chapter covers the technical aspects of integrating Oracle with Hadoop. The concept of building a hybrid system based on Oracle and Hadoop, as it is described in Figure 1, requires two main elements to be put in place: data integration between Oracle and Hadoop and a SQL offloading mechanism.

Figure 1 Schematic overview of integrating Oracle and Hadoop for data and query offloading
There are two main scenarios to consider when connecting Oracle and Hadoop to build hybrid databases and they depend on the source of the data. In the first case, data comes originally from an Oracle database and needs to be moved to the Hadoop system. Typically data that is immutable will be moved. Another scenario is when data comes into Hadoop directly from external sources and we want to be able to query it from Oracle, for example for integrating it with existing applications that reside in Oracle.
Once data is stored in HDFS, it has to be structured in form of tables and made available for querying with SQL using one of the SQL engines available in the Hadoop ecosystem. There are multiple choices for a SQL engine on top of Hadoop that are applicable to the use cases described here: Apache Hive, Apache Impala (incubating) and Apache Spark SQL. If the offloaded queries require low-latency response time and/or will mainly perform basic SQL operations for reporting and aggregation, Apache Impala appears from our tests as a good choice. On the other hand for heavier workloads that requires complex table joining, integration with other systems for ETL or need to perform complex computation (such as machine learning jobs), Spark SQL appears currently to have a significant advantage over concurrent solutions. Hive SQL engine should rather be avoided for performance reasons, as it is based on map reduce and performs considerably slower than Impala and Spark SQL. Irrespectively of the chosen SQL engine on Hadoop, configuring the query offloading mechanism is a relatively easy operation.

Offloading data to Hadoop
There a multiple ways to move data from Oracle to Hadoop. One of the key components in the process is about the software to be used to instantiate data flow between Oracle  solutions currently available the most relevant are: Apache Sqoop, Apache Flume (which can be configured to read directly from a JDBC source or from Apache Kafka), the deployment of a custom connector built using JDBC, or Oracle GoldenGate connector for HDFS which allows to ingest data with low latency. From tests and experience by the authors using such tools, Apache Sqoop appears to be one of the most powerful and at the same time one of the less complex tools available in this domain. Sqoop allows to copy data from Oracle in batch mode by executing a Hadoop-side distributed map-reduce job. The job exploits parallelism for performance, by establishing multiple connections to a JDBC source and parallel data import. Sqoop is flexible in the granularity of data copying, which can be a group of one or more tables or table partitions. Sqoop however is not suitable for cases where realtime data ingestion is needed. By nature Sqoop works with bulk ingest and is well suited as backend for jobs that run data ingestion at regular time intervals (which can be daily or hourly for example, depending on the nature of the data). Typically this is not a problem as data analysis is performed on large data volumes of immutable data. Hybrid views, that combine recent data in a relational database with immutable (read-only) data in Hadoop can also be deployed to work around this issue where needed and are discussed later in this article. Another important aspect to keep in mind when building data ingestion pipelines with Sqoop, is that its incremental table import feature supports only newly created rows, this means that rows that were previously imported and recently updated will not be imported again.
There is a dedicated connector for Oracle databases available for Sqoop, called "Oraoop", that includes multiple optimizations for data import and export specifically tuned for Oracle. By reading the data sequentially and directly from database storage, Sqoop with Oraoop allows to improve data transfer throughput and reduces the footprint on the source database system. This addresses an important aspect of the data movement operation: the source Oracle instance should not suffer negative performance from data offloading activity. Sqoop is currently used at CERN database services to build and keep up-to-date a copy of the Oracle database for accelerator controls at CERN into a Hadoop system. This currently amounts to approximately 500 TB of data, growing at a rate of 500 GB per day.
Another important aspect of shipping data to Hadoop is about the way how data is stored on HDFS. Since HDFS can store any type of files and can nest them in complex directory structures, multiple approaches are possible. Typically the data format decision is based on the expected data access patterns. If there is a need for ad-hoc analytics-type queries, column-oriented file formats like Parquet are a good choice. Parquet for example, guaranties fast data access based on column projection and at the same time offers very good data storage efficiency compared to row-based formats like CSV, Sequence Files or Avro.
Data partitioning strategy is another important architectural design decision. In Hadoop, data partitioning is done by grouping data files in subdirectories, each directory representing a separate partition. There might be multiple levels of partitioning applied as they can be nested in multiple levels of directories. Very often the partitioning key is time-based. Data produced on a given day or hour is stored in the same partition and is located on a dedicated directory in HDFS. Such strategy allows to prune data while processing: only the directories belonging to the time range of interest are accessed, which is highly beneficial for query performance.
Finally, once the data has landed on HDFS it has to be mapped into table structures, as a preparation for subsequent access using SQL queries. Common practice here is to use the Hive metastore, as most SQL engines on Hadoop support reading from tables mapped with Hive. Mapping data stored in HDFS directories into Hive metastore tables refers to the operation of storing metadata information about a given data set, including the location of the data files, the file format used and information on column names and data types. All this information can later be easily accessed by the SQL engine using the reference Apache Sqoop supports importing data directly into Hive tables or storing them in a dedicated HDFS directory (which can be later mapped to partitions of a table in Hive). Regardless of the import mode, data can be stored on HDFS using Parquet format. An important detail is that one has to make sure that the data are landing in Parquet are assigned the desired data type and precision. For example, a typical pitfall when moving data of type "timestamp" from Oracle is that by default this is converted to "long epoch" type which means it will lose time information finer than millisecond. The appropriate type conversion can be specified in the copy process to avoid this pitfall.

Offloading SQL to Hadoop
Once data is stored in HDFS and mapped into Hive tables it can be queried with the relevant Hadoop APIs and in particular with SQL engines built on top of Hadoop. In this paragraph we cover some of the techniques that can be used to query data stored in Hadoop from the Oracle RDBMS. Specialized connectors are needed for this task as they have to guarantee the functional and performance aspects of retrieving the results of the SQL queries that are returned from Hadoop to Oracle. There are also additional pitfalls in this process as there are cases of incompatibilities between the SQL implementation between Oracle and some of the SQL engines on top of Hadoop. This implies that in some cases SQL statements need to be modified to make them compatible with SQL on Hadoop.
The techniques to query Hadoop from Oracle described in this paragraph are based on Oracle Heterogeneous Services. Heterogeneous Services provide a mechanism to connect and query any ODBC compliant source from the Oracle RDBMS. This is done by using The Oracle ODBC gateway, a component that is part of the Oracle database bundle.

Figure 2 Query offloading from Oracle to Hadoop based on ODBC gateway
Once configured and registered to an Oracle listener, the ODBC gateway is able to translate requests issued by Oracle via the proprietary Oracle*Net protocol into ODBC calls. This applies to both SQL executions and to the transfer of the result set over the network.
Typically the Hadoop SQL engines do not accept ODBC calls directly, they rather support the Thrift protocol. Therefore an extra driver has to be installed on the ODBC gateway that will translate ODBC calls to Thrift/Hive invocations. Such drivers are available from SQL engines software providers.
From the Oracle RBMS side the integration via the ODBC gateway is mapped logically into the database namespace using the database link infrastructure, this is similar to what happens when using distributed configurations between Oracle databases and provides a straightforward integration with existing Oracle applications. This way any Hadoop

Limitations of using the Oracle ODBC gateway
There are some important limitations to the approach for querying Hadoop from Oracle described so far. The first one is the implicit data type conversion mismatch: not all of the Oracle data types have their equivalents on the target technology (e.g. strings in Hive are seen as long on Oracle side, which limits their usability). A workaround to this problem consists in creating views in Hive for each table.
Views are standard objects in database systems, they define a SQL query which will be evaluated when the view is being accessed. The view will contain explicit data type conversion that can provide the desired mapping from Hadoop to Oracle datatypes. For example, strings in Hive can be explicitly converted to varchar2 in Oracle using this method. An example of Hive view creation for explicit data type conversion: "create view my_hadoop_view_with_varchar2 as select variable_id, utc_stamp, value, cast( state as varchar(4000) ) state from my_hadoop_table" Another limitation is about data transfer performance through the ODBC gateway. When a query offloaded to run in Hadoop needs to return a large result set (for example millions of rows) this can take a significant amount of time to bring back the results into the Oracle engine. In some cases this slow operation may offset the gains of running the query in Hadoop in the first place. The performance issue has its origin in multiple hops and conversions that happen in the path between Hadoop and Oracle: Thrift to ODBC, ODBC to Oracle*Net. Some of the components that are performing such conversions, notably the ODBC gateway, are single-threaded so are not capable of delivering scalable throughput. Typical values of the transfer speed through the Oracle ODBC gateway observed in our tests were not higher than 20000 rows/second. The actual throughput limitations depend on the application and on the row size. In conclusion, the method described here for querying Hadoop from Oracle appears mostly relevant for reports and analytics-type queries, where the final result set is of relatively small size.
Another important limitation of the SQL offloading method using Oracle Heterogeneous Service and the ODBC Gateway, is the lack of control on the offloaded query text. As previously mentioned, the interface used for running SQL remotely from Oracle makes use of Oracle database links. With such approach, it is the Oracle SQL optimizer that choses which part of the query has to be executed remotely (offloaded) to Hadoop and which is executed locally (by the Oracle RDBMS). The simple case where the SQL execution requires pushing predicates into remote tables in Hadoop showed to work as expected in our tests. However, when running aggregations on remote tables we have observed that Oracle would retrieve the entire result set from Hadoop into the Oracle engine and run the aggregation locally. As discussed earlier, this generates performance problems due to the slow data transfer rate over the ODBC Gateway and overall does not make good use of the scalability of the Hadoop engine. In the next section we discuss ways to work around this problem.

Manually pushing SQL from Oracle to Hadoop for remote execution
We will discuss in this section two techniques that can be applied to gain more control on the execution of remote SQL into Hadoop and provide performance improvements in several situations of interest, in particular with complex queries performing aggregations.
The first technique is to use dedicated views on top of Hive tables in Hadoop. By using views, one can package complex queries to be executed on the target Hadoop system. Such views can be made visible to Oracle over standard database link syntax, using the technique discussed in the previous section. As a result this technique allows to encapsulate complex Hadoop-side processing inside a query in an external view, effectively working around the performance problems that can be found when the Oracle optimizer does not push operations such as filter predicates and aggregations remotely to Hadoop SQL engines.
There are two major drawbacks to this technique. One is that it requires to create a view on Hadoop (Hive) for each different type of query that needs to be executed. This ultimately reduces the flexibility A different and more flexible solution to pushing generic SQL into Hadoop can be achieved by using a PL/SQL package called dbms_hs_passthrough. This package is part of the Oracle database components and allows to execute SQL statements on remote systems (in this case the remote system is Hadoop accessed via the database link). The PL/SQL package returns an Oracle cursor which can be accessed programmatically to fetch the result set of the query being executed. An advantage of using dbms_hs_passthrough is that SQL statements can be dynamically constructed. Moreover this method naturally enables filter push down and remote execution of aggregate functions without the need of creating any object on the Hadoop side. This technique comes however with the extra effort of implementing PL/SQL wrapper code for each type of query, in particular the code to retrieve the result set from the returned cursor. For example, additional coding is required to handle the case where the results of a SQL query performed on Hadoop side need to be joined with tables or views in Oracle. This case has been tested as part of this research and our conclusion is that it can be implemented by wrapping the results from the cursor obtained by dbms_hs_passthrough into an Oracle "pipelined table function".

Hybrid views to map Hadoop data into Oracle
The techniques described in the previous section allow to query tables defined in Hadoop using SQL originating from Oracle. Building from that it is possible to create views in Oracle that combine data stored natively in the Oracle database with data that is remotely stored in Hadoop. We refer to such structures as "hybrid views". An example of SQL defining a hybrid view is as follows: Hybrid views allow to glue together data in Oracle and Hadoop and aim at providing seamless integration between the two systems (with some limitations as discussed further). The Oracle engine is able to push selection predicates into the view and redirect the query to retrieve data from Oracle or Hadoop as relevant. This works well for example for tables that have a time-based partitioning structure. The problem that this approach allows to solve is to make both the current data, stored in Oracle, and the archive data, stored in Hadoop, available from Oracle APIs, seamlessly as if the system was just one database. Note however, that the limitations discussed in the previous section regarding SQL offloading also apply to hybrid views.
A characteristic of the hybrid views is the fact that data is read-write for the part that is in Oracle and read-only for the part that is in Hadoop. It is up to the application to know which parts can be updated from the Oracle API. This is often not a problem in the case of systems that log data into time-based partitions, where only the current partitions is written at a given time.
A potential limitation of this approach is about data movement. It is up to the application architect to implement a system to move (offload) data from Oracle into Hadoop and keep the hybrid view current. In the example of time-based data, this would mean periodic data movement as data partitions in Oracle change their status from current to archive and can be shipped to Hadoop. Tools such as SQOOP described earlier can be used to physically move the data.
A third point to consider is that the Oracle optimizer will not push certain type of operations, notably aggregations, to the Hadoop-based SQL engine. It would rather fetch all the data from Hadoop and perform the aggregation locally in Oracle. This can have important implications for performance. It is possible to work around this by explicitly pushing query execution into the Hadoop engine using Oracle's PL/SQL package DBMS_PASSTHROUGH for example, as described earlier in the article.

Querying Oracle from Spark SQL
Apache Spark is currently one of the most popular frameworks for distributed data analysis. In particular Spark allows users and developers to utilize several specialized libraries for data processing, including APIs for machine learning at scale and mature functionality to query very large data set at scale using SQL. Examples of this type of activity include running long-running reports and workloads that require to query a large sample of the data. Such use cases are currently of interest, for example, for the community of CERN accelerator controls and for ATLAS Distributed Data Management. Data coming from relational databases can moved to the Hadoop/Spark clusters with bulk loads/ingestion using via tools like SQOOP, as discussed in section 3.1. However, it is often the case that a small fraction of the data of interest coming from relational databases is updated frequently and does not fit well in the bulk load scheme. This can be the case for example for metadata in several applications of interest for logging of controls data. It is important in those cases to be able to query the up-to-date data in the relational database from the Hadoop platform. This can be done using Oracle JDBC driver as a data source for Spark. Here is an example: spark.sql(""" CREATE OR REPLACE TEMPORARY VIEW my_mapped_oracle_table USING org.apache.spark.sql.jdbc OPTIONS ( url='jdbc:oracle:thin:@MYORADB/orcl.cern.ch', driver 'oracle.jdbc.driver.OracleDriver', user 'myuser', password 'mypassword', dbtable '(select id, payload from my_oracle_table) df')""")

Conclusions
In this article we report on techniques and methods for integrating Oracle and Hadoop that are of interest for CERN and HEP database applications in general. The Hadoop and Spark ecosystem is rapidly growing in adoption as a solution for scalable processing, analytics and machine learning, while Oracle has still a large installed base for critical application related to LHC and experiments data processing. In this context, the methods and tools discussed in this article are of interest when architecting systems that can take advantage of the best features of both systems. Techniques that allow querying Hadoop from Oracle and Oracle from Hadoop have been discussed together with their limitations and the use cases where they can add value to end-user applications. The techniques for data movement discussed in this article also enable to write scalable applications in Hadoop to analyse data originating from Oracle databases.