An Observability System for the DaFne Platform
1 Department of Computer Science, University of Applied Sciences Hamburg (HAW)
The research project DaFne from the University of Applied Sciences Hamburg builds a generic machine learning platform. This platform is based on Kubernetes and has a microservice architecture. Observability is the key to successfully run this platform. Observability comprises a wide range of topics like monitoring, logging, tracing, and measuring [2]. At this point, the DaFne platform has neither an observability model nor an observability strategy. So this brings up the following research question RQ: How can the DaFne platform reach observability?
The platform has multiple usecases like data generation, neighborhood generation, agent-based generation of mobility data and maybe some more in the future. All these usecases have one aspect in common, they use machine learning algorithms to generate and evaluate data. Against this background, we need to check the metrics of the platform and monitor the health status of the applications. It is important to log both requests to the platform and output from each service. This information makes it possible to detect bottlenecks and failures [1].
Figure 1: The Design Science Research Framework adapted from Hevner et. al.[3]
Figure 1 shows how this research is embedded into the concept of Design Science Research and how the design will be influenced by existing methodologies and foundations. Besides that, it shows how the observability strategy and model finds application for people and organizations who use the platform. Model and strategy will be evaluated for example by experts or in field studies in order to be refined and approved [4].
Figure 2: Overview Full-stack observability[2]
As seen in Figure 2 this approach covers the three most important parts of a system security, infrastructure and application. Full-stack observability is based on one source of truth. The resulting data and metadata can then be used to analyze and improve the DaFne platform itself using AI algorithms.
The aspired solution for the RQ is the development of an architecture model based on the concept of full-stack observability. This includes the use of various monitoring, logging and measurement tools that can send their data to a common data lake. Deciding which tools are a good fit for the platform is part of the research process and influences the architectural model. All data collected by the tools must be stored in a single source. This reduces the workload on the network and the services. It also makes it easier to access the data because it is all bundled together from where the visualization accesses the data. Visualization helps identify bottlenecks and makes it easier to see the big picture of the system. An important component is the real-time view of the system and especially of the infrastructure resources. This is because a lot of data is calculated on the platform using ML algorithms. It is useful to see whether the system’s resources are sufficient and being used fairly. Kubernetes offers several APIs for measuring pods, containers and services. A standard for logging in the services should also be defined in the course of the investigation.
The outcome of this research will be an observability model and observability strategy for the DaFne platform. These artifacts are evaluated by partners from the project group and adapted in several iterations. The evaluation will also be carried out through experiments on the prototype. In further work, this research will be implemented and the results will be documented.
An Observability System for the DaFne Platform