Data Standards & Pipelines

What are Data Standards and Pipelines?

The GDR data standards and pipelines create consistency in formatting and contents of like datasets, lessening preprocessing requirements and ensuring adequate information is provided by a given dataset. Data standards and pipelines are different for each data type, in an attempt to best fit a given data type's formatting, metadata, and other requirements.

Existing Data Standards

The GDR's drilling data pipeline automatically converts drilling data from native Pason or RigCLOUD ouput format into a standardized CSV format with standardized column names and units.

The GDR's DAS data pipeline automatically converts DAS data from nonstandardized SEG-Y formats into a standardized HDF5 format, based on PRODML and the IRIS DAS RCN's DAS metadata standard.

The GDR's geospatial data pipeline focuses on metadata rather than the data itself. It automatically recognizes geospatial data files using file extensions, and requires additional and essential metadata for geospatial datasets.

Coming Soon

When the GDR is working on implementing a new data standard & pipeline, it will be listed here. If you have suggestions for new data standards and pipelines, please send them to GDR Help.

Tips to Ensure Your Data are Standardized

To ensure your data are standardized, follow these tips:

  • Do not enclose your data in zipped directories. Instead upload files with their original file extensions to the GDR submission form or data lake.
  • Do not modify the format of your files from the original export format before uploading to the GDR.
  • Maximize the metadata you provide in the GDR submission form or via a README file. More is always better.
  • Check out the more specific tips on each of the individual data standard pages linked under “Existing Data Standards” above.

If you think your data were not standardized in error, please contact GDR Help. We are happy to assist and are constantly looking to improve are data standards and pipelines.

Why Standardize?

High quality data is a key component for producing high quality machine learning results and for the applicability of machine learning to real-world problems. Machine learning is frequently exploratory in nature, meaning that data curation is often an iterative process throughout the life of a project. Modular data pipelines and standardization of processes and practices help to streamline this process.

That said, any processes that can be taken to lessen data curation requirements are helpful in improving geothermal machine learning and data science outcomes. Data standardization puts similar data sets into a standard format, which lessens the time spent by researchers reformatting and combining data sets. This overall reduces the amount of time required for adequate data curation, both reducing the overall cost of machine learning projects and allowing more time for exploring different machine learning experiments and properly interpreting results. They also allow users to incorporate many datasets efficiently and possibly automatically into machine learning projects, as opposed to focusing on just one or a few manually parsed datasets.

Automated Data Pipelines

Data pipelines have been implemented for select high-value data sets to automate the standardization process. The GDR's data pipelines automatically recognize certain types of datasets, and then convert them into a standardized format while also preserving the original data file. This shift takes the burden of data standardization off the user and project teams, allowing more project resources to be used on research and development activities, and increase the availability of standardized geothermal data available through the GDR. A set of recommendations and a data standard for each data type, will exist with each data pipeline in order to advise data collection for maximum usability for future research.

The National Geothermal Data System (NGDS) provides standardized templates in Excel and XML formats for users to input their data into. The NGDS Content Models were developed with the intent of being all-inclusive, meaning that there is a column for every possible measurement associated with a particular data type. The list below describes the existing NGDS Content Models.

Abandoned Mines Active Fault / Quaternary Fault Aqueous Chemistry
Borehole Lithology Intercepts Borehole Lithology Interval Feature Borehole Temperature Observation
Contour Lines Direct Use Feature Drill Stem Test Observations (deprecated)
Fault Feature / Shear Displacement Structure Fluid Flux Injection and Disposal Geologic Contact Feature
Geologic Fault Feature / Shear Displacement Structure Geologic Reservoir Geologic Units
Geothermal Area Geothermal Fluid Production (deprecated) Geothermal Metadata Compilation
Geothermal Power Plant Facility Gravity Stations Heat Flow
Heat Pump Facility Hydraulic Properties Mineral Recovery Brines
Physical Sample Powell and Cumming Geothermometry Power Plant Production
Radiogenic Heat Production Rock Chemistry Seismic Event Hypocenter
Thermal Conductivity Observation Thermal/Hot Spring Feature Volcanic Vents
Well Fluid Production Well Header Observation Well Log Observation
Well Tests

Learn More or Submit Feedback

If you want to learn more about the importance of data standardization for data science in geothermal, check out this Stanford Geothermal Workshop paper: Taverna, N., Weers, J., Huggins, J., Anderson, A., Frone, Z. “Improving the Quality of Geothermal Data Through Data Standards and Pipelines Within the Geothermal Data Repository (GDR).” Proceedings of the 48th Workshop on Geothermal Reservoir Engineering, Stanford Geothermal Program (2023).


The GDR team is continuously working to align its efforts with the needs of the geothermal community, and we would like to invite you to provide your feedback here: GDRHelp@ee.doe.gov.