The Building Data Genome Directory – An open, comprehensive data sharing platform for building performance research

The building sector plays a crucial role in the worldwide decarbonization effort, accounting for significant portions of energy consumption and environmental effects. However, the scarcity of open data sources is a continuous challenge for built environment researchers and practitioners. Although several efforts have been made to consolidate existing open datasets, no database currently offers a comprehensive collection of building data types with all subcategories and time granularities (e.g., year, month, and sub-hour). This paper presents the Building Data Genome Directory, an open data-sharing platform serving as a one-stop shop for the data necessary for vital categories of building energy research. The data directory is an online portal (buildingdatadirectory.org/) that allows filtering and discovering valuable datasets. The directory covers meter, building-level, and aggregated community-level data at the spatial scale and year-to-minute level at the temporal scale. The datasets were consolidated from a comprehensive exploration of sources, including governments, research institutes, and online energy dashboards. The results of this effort include the aggregation of 60 datasets pertaining to building energy ontologies, building energy models, building energy and water data, electric vehicle data, weather data, building information data, text-mining-based research data, image data of buildings, fault detection diagnosis data and occupant data. A crowdsourcing mechanism in the platform allows users to submit datasets they suggest for inclusion by filling out an online form. This directory can fuel research and applications on building energy efficiency, which is an essential step toward addressing the world’s energy and environmental challenges.


Introduction
The rise of artificial intelligence as a tool for built environment applications has the potential to impact several industries significantly.However, data availability in the built environment domain remains a critical bottleneck due to privacy concerns and acquisition costs [1].Open data sources are essential for understanding energy consumption patterns, identifying areas for improvement, and testing energy-saving strategies, especially in the absence of in situ measurements.Yet, access to open data sources in the built environment domain lags behind other communities [2], posing limitations for researchers and practitioners in developing effective energy-saving solutions [3].In addition to limited accessibility, available open datasets are often dispersed and require labor-intensive and time-consuming collation due to varying formats and sources [4].Efforts have been made to aggregate open datasets and share them through platforms or directories such as the Building Performance Database (BPD) [5], the Building Data Genome (BDG) projects [6,7], and the Directory of Buildings Energy Consumption Datasets (DBECD) [8].However, these projects have limitations in the diversity of data types, lack of user contributions, and missing data.
This paper outlines the development of a comprehensive data-sharing platform for building performance research.This effort is achieved by creating a data directory that is publicly available and includes functions for filtering, visualization, and uploading new data sets.The Building Data Genome Directory is a lightweight web app that links to a wide range of open datasets, offering users easy access to comprehensive coverage of relevant information.In subsequent sections, the paper will introduce the data sources, data category definitions, reasons for inclusion, critical functions of the web app, and some application cases.

Data sources
The directory focuses on collecting information about open building performance datasets that are widely dispersed and fragmented, which conventionally would require a rigorous data collection process.Metadata for the directory was gathered from various open data sources, including government disclosure programs, research projects, institutes, and publicly available dashboards.Details on each of these data source categories are discussed in the following subsections.The directory data sources are divided according to category and type of data based on the format (e.g., tabular, image) and process of the system that created the data (e.g., HVAC, occupants, sensors).Figure 1 shows an overview of the data set categories, which will be outlined in the following subsections.

Government disclosure data
Data from government disclosure programs is a significant source for built environment data.One example is the Local Law 84 (LL84) of New York City (NYC) in the United States, which requires building owners to disclose their energy and water consumption data through benchmarking annually [9].This directive has led to the publication of the Energy and Water Data Disclosure dataset for Local Law 84 by the NYC government.These city-level datasets can contain many samples, with some featuring tens of thousands of buildings, although they may have coarse-grained time intervals of a year or a month.To collect these datasets, a comprehensive review of relevant literature and examination of laws pertaining to data disclosure was conducted [1].Open data portals provided by city governments [10], such as the NYC open data portal (https://opendata.cityofnewyork.us/),were also browsed to gather available datasets, ensuring the comprehensiveness of the data directory.

Open research data
Research institutes and organizations have published various datasets for building performance research.Some datasets are available on websites, such as the Building Data Genome dataset on Kaggle [6] or the 3D city model of Singapore public housing buildings on GitHub [11].Other datasets are published through journals, with Scientific Data being a significant venue.A recent review has also listed open-source datasets for building energy demand [2].These datasets typically provide detailed information about individual buildings but may not have large numbers of samples (generally less than 5,000).A common differentiator of these types of data sets is that the time-series frequency may be higher, sometimes even at the minute level, offering a more granular view of a building's energy usage.Some datasets also provide detailed information about building characteristics, solar installations [12], morphological indicators [13], or sensor locations and building structure [14].Accessing and leveraging these datasets allows researchers to gain comprehensive insights into individual buildings and their energy usage.To collect these datasets, relevant reviews and research papers were examined, including platforms that provide access to datasets referenced in articles,

Data collected from open, online dashboards
In response to the growing emphasis on net-zero and sustainability goals in the higher education sector, many educational institutes and universities, such as the University of California, Berkeley, Cornell University, and Princeton University, have public energy management dashboards that provide access to energy usage data for further study and analysis.For these datasets, a data acquisition pipeline can be built using scripts to automate the process of extraction from these dashboards, enabling batch downloads of performance data from thousands of buildings.The directory includes several datasets that were retrieved from these types of public web-based energy management dashboards.For many of these dashboards, the API of the data source can usually be found using built-in web browser developer tools.Once the data API is identified, an automated process can be configured with the required data parameters, such as building ID and specific time period, to enable batch downloading of performance data from a web-based dashboard.

Overview of the directory interface
The Building Data Genome Directory can be found online at: buildingdatadirectory.org/.The interface comprises of a main page, referred to as the Meta Directory, which provides an overview of all available datasets and several sub-pages presenting datasets by types.The Meta Directory page introduces the Building Data Genome Directory and outlines the scope of the collected datasets.As a web app, it has filtering, visualization, and uploading functions for the datasets.Datasets pertaining to buildings, such as Building Energy and Water and Building Table 1.Categories of the data in the directory with short descriptions and an example representative dataset of each type.Ground-truth and simulated datasets for anomalies in the built environment and building systems Large-scale Energy Anomaly Detection (LEAD) Dataset [21] Occupant Data The thermal comfort data of occupants collected from experiments Cozie smartwatch application [22] Information provide geospatial granularity levels that correspond to individual buildings or, at the very least, communities, instead of the aggregated data of an entire city.The Meta Directory includes a schematic diagram showcasing the various datasets available in the Building Data Genome Directory, as shown in Figure 1.Each black label in the diagram represents a specific data type and has a corresponding subpage, with its link conveniently located on the left column of the web page.The scope description for these types and the representative datasets are presented in Table 1.The Add New Dataset uploading function is at the bottom of the left-hand button.Users must fill in the Dataset Name, URL, and Dataset Type items to submit a possible contribution to the directory.The datasets submitted by the users will be stored and displayed at the bottom of the Meta Directory page, and they will be added to the directory after undergoing a review process.
The category with the highest number of data sets is Building Energy and Water, which includes over 30 datasets at the moment.A metadata table that provides essential information about the datasets is displayed on this page, including disclosure status (e.g., data opening level, license availability, organization) and information on the building samples.Figure 2 shows the filtering and visualization functions.The filtering functions enable users to select datasets by location, time interval, and building type.The visualization functions include bar plots with adjustable axes to visualize numerical information, bubble plots to display sample and variable numbers with the size of circles denoting sample sizes and variable quantities, and heatmaps to visualize variable categories.

Conclusion and future works
The Building Data Genome Directory is a potentially valuable resource for building energy research, providing comprehensive datasets and web app functions for filtering, visualization, and uploading.This directory can be a starting point for researchers and analysts who want to start the exploration process for applicable open data sets for their studies.Numerous research endeavors are anticipated to emerge as branches stemming from this directory.As highlighted by Jin et al. [1], the availability of comprehensive datasets will significantly expedite research in building energy, encompassing areas such as building energy management, grid management, and socio-economic analysis.The team is developing a sub-branch within the Building Data Genome Directory focusing on time-series feature analysis utilizing energy consumption data.

Future expansion and data quality considerations
Future work can optimize the directory by improving functions such as allowing brief dataset descriptions during uploading and incorporating semantic searching capabilities.Enhancing search capabilities for different data types, such as geographic location, would also improve usability, as well as considering unconventional data sources such as scraping relevant data on buildings from property websites [23] and considering volunteered geographic information such as OpenStreetMap [24] in locations that have data of reliable quality.Finally, to strengthen the crowdsourcing aspect of our platform, we plan to implement a functionality to allow users to flag erroneous information and allow trusted users to edit the database.Building a community around the directory would foster user communication and optimize the web app.Collecting feedback and insights through discussions and forums would provide valuable inputs for enhancing features and usability.By actively engaging with users, the directory can continue to evolve and serve as a valuable resource for building energy researchers b

Figure 1 .
Figure 1.Schematic of the categories of datasets included in Building Data Genome Directory

Figure 2 .
Figure 2. Building Data Genome Directory interface showcasing the filtering and visualization functions.