Why are large-scale industrial data solutions so hard to build?

Jun 26

Table of Contents

Data Disparity
Data Quality
Data Access
Data Speed and Volume
Solution

Today, every organization is using data to improve their products and processes. Yet many of them struggle to scale their data solutions. In other words, building a demo is easy, but scaling into a production-grade solution is very hard. Since Prescient works only on large-scale data solutions, we’ve seen many of these challenges along the way, and in this article we describe these main challenges.

Data Disparity

Industrial data is very disparate. They come from different sources such as equipment, sensors, controllers, and historians. They come from different generations of hardware where each generation can have different data interfaces and data protocols. Each data protocol requires a different data connector, and it is very time consuming to implement and support these data connectors.

Data Quality

Industrial data tends to have poor data quality. Typical issues include missing data, duplicates, noise, wrong values, stuck values, and more. In addition, meta-data is almost always missing from the data source and needs to be added in order to perform analytics. Data quality is a major challenge to data science teams.

Data Access

Unlike structured enterprise data, industrial data are typically located on-premise across potentially thousands of locations. This makes accessing the data difficult. Often an edge computing device needs to be placed at each location, which means industrial data solutions need to manage these resource-constrained edge devices, support embedded operating system, deal with on-premise networking and security requirements, and coordinate solutions across thousands of locations. Not only is this difficult to implement, but it also requires constant monitoring. This is called edge devops. Just like cloud devops, edge devops is a full time job.

Data Speed and Volume

Because AI requires high-resolution data, data speed has been increasing. Just a few years ago, 1-second data was regarded as high-speed data, but today it is common to acquire data at 10s of kilo-Hertz. Even at a 1-second data rate, 2.5-Million data samples are generated per month. Data sample size tends to be large because it is typically a JSON structure that includes multiple meta-data fields. Assuming 100-bytes per sample and 100 data tags per location, this results in 25GB per month per location. This amount of data creates several challenges:

Data transmission. If data is transmitted over cellular or satellite, this cost is too prohibitive. More processing needs to be done at the edge to reduce this data volume.
Data ingestion. Cloud providers not only charge by the amount of data ingested, some of them also throttle if too much data is ingested. Careful design is needed to mitigate these problems.
Database access. Too much data and too many queries can cause database performance issues, and they can sometimes even bring down the database. Techniques such as database shadow and query rate limiting should be implemented to improve database performance and reduce database cost.

Solution

We have seen many enterprises struggle with scaling their industrial data solutions. Our recommendation is to work with a vendor that has the experience to provide large-scale data pipelines. Data pipeline is the part of the data solution that collects, cleans, and prepares the data. It is also the part that is very cumbersome, but does not directly generate business value. In other words, data pipeline is like cloud infrastructure - you need it, but you wouldn’t build it yourself. Instead, enterprises should focus on AI and data applications that sit above the data pipeline, because that’s where the value is for your business, and that’s where your domain expertise matters the most.

Edge Data PipelineData Challenges

Andy Wang, Ph.D. https://www.linkedin.com/in/andy-wang-2005/