I work at a data science startup, but not as a data scientist. Business development is my specialty, rubbing shoulders regularly with science and engineering PhDs. Frequently I test their patience with some pretty basic questions about machine learning. A recent ‘aha moment’ for me was when I learned about feature engineering and the massive value it adds to predicting oil & gas exploration and planning outcomes. At recent events I’ve heard other data science newbies probing to better understand this concept and thought sharing would be helpful.
Machine learning (ML) is a method of data analysis using algorithms or models to learn patterns in datasets and then predicting similar patterns in new ones. For example: Well A produces X with completion design 1 and Well B produces Y with completion design 1, so Well C should produce Z with completion design 1. Neat, right?
The accuracy of ML models depends on the data that is fed to them, and for most E&Ps there are tons of data. How do we find the data needed to solve a specific problem? Even more, how do we identify meaningful relationships within that data? Feature engineering is the use of domain expertise to create an enriched dataset for a specific problem by adding additional variables, or features, using raw data. This enriched data input improves ML model performance and accuracy. Feature engineering is a difficult, time-consuming, and tedious task that can take weeks or months of manual effort. When automated, it creates exponentially higher value by finding relationships the human mind can’t and accelerating the creation of an ML-ready enriched data set.
Now for the ‘aha moment’ – you can actually code oilfield principles into an engineered feature to isolate and highlight key information in the raw data. This helps ML models focus on what’s most important to the problem being solved. Example: For predicting pre-drill EURs of tightly spaced wells, OAG uses location and time to determine which wells are interfering with each other and computes a score based on distances, time, and completion attributes. This score, or feature, is added as a new column to the enriched dataset. Three data points were used to create this one feature – well location, well initial production date, and completion design. The ML model output is more accurate decline curves because the engineered feature was based on known physics of well interference
Previously to my ‘aha moment’ I was calling these “physics-based ML algorithms” and engineers (especially) pushed back on me asking how that was possible. Now I understand why. The accurate language is a bit more detailed - physics-based or geometry-based feature generation algorithms that enable ML algorithms (i.e. Well Spacing Model or Predict Missing Core Sample Model) to function at optimal levels.