Data Lake System Implementation.

Summary
The data lake will able to store the incoming data from the sensors and actuators in the three demonstration plants. The “data lake” will store the raw information as it is sent from the devices and provide a data processing pipeline that will validate, clean, homogenize, aggregate and transfer the data to a data storage that will serve as the data source for the digital twin and the representation dashboards, as well as, the Deep Learning algorithms. The data store will also provide a semantic data layer that will enrich the information with annotations to be used in the dashboard representations and the digital twins. This semantic data will be provided by the data processing pipeline and act as an abstraction layer between the raw data coming from the sensors and the representation layers, allowing to have a common representation model for the several processes involved.The data lake will be a common infrastructure to the three case studies and will be hosted in a public or private cloud environment accessible to all interested parties under high security provided by the IMPACT platform. Data processing will be carried by elastic infrastructure components based on big data techniques such as Spark and Hadoop and running on containerized workload management environments such as Kubernetes.