The second release of the AEGIS integrated prototype features a partially functional high fidelity software prototype connected to a deployed version of the platform, providing an initial interface including the basic UI/UX for the users of the platform.
The GUI/Front-end has been substantially improved according to the initial feedback received, still based upon the AEGIS institutional web site . When the user accesses the platform for the first time, he/her may enter the account registration, and then proceed with the login. Once the user is logged into the platform, a main page shows two tabs: one for the Assets, the other for the Public Datasets. On the right side it is showed the list of the user projects, as well as the button to creating a new one. After a project has been selected, a new page shows the related activity history and the main menu on the left side, including the following items: Get Started, Assets, Project Datasets, Query Builder, Analytics, Visualiser, Jupyter, Zeppelin, Kafka, Jobs, Metadata Designer, Settings, Members, Project Metadata. In addition, on the top-right corner of the page, a search functionality is available (see Figure 1).
Below an overview of the main components of the platform is provided.
Data Harvester and Annotator
These are two tightly coupled components that facilitate the import of data and metadata from a variety of heterogeneous sources, undertaking also the necessary transformation actions towards the required data format and structure. After the initial testing, the StreamSets platform (see https://www.aegis-bigdata.eu/releasing-the-aegis-platform-v1/) proved to not be flexible enough to enable a generic use within the current ecosystem. Thus, despite being a powerful tool, StreamSets was dropped in favour of extending the software that has previously been written by Fraunhofer FOKUS.
Data Harvester Microservices
In order to tackle the mentioned problems a different approach was chosen. Instead of using a third-party software, the Java based EDP Metadata Transformer Service (EMTS) has been split up into its components, transforming the monolithic application into a microservice architecture, consisting out of the following four basic services:
- Importer– implements all functionality for retrieving data from a specific data source
- Transformer– converts the retrieved data from an importer into the target format of the AEGIS platform
- Aggregator– collects converted data from a transformer over a configurable time interval
- Exporter– uploads transformed and/or aggregated data to the AEGIS platform
To name a concrete example: Weather data may be imported hourly from an external service as JSON, which is then transformed into CSV with values being converted into the metric system. The results are then aggregated for 24 hours and the final file exported to an analytics platform.
Data Harvester Technology
In order to reuse as much as code as possible, the microservices introduced in the previous section have also been written in Java. To aid with the creation of each service the framework Eclipse Vert.x has been chosen. Apart from propagating an asynchronous programming paradigm ensuring a high availability and throughput, Eclipse Vert.x also provides a lot of built in support for managing microservices. The final services are deployed via containerization, in which each service is isolated from its host and other service’s operating systems. The services are thereby packaged separately as images. The technology used for building these images is called Docker . Aside from enhancing security this method also ensures platform independency, which means that these images (and consequently the services in question) will run on any infrastructure providing a Docker runtime. To handle the frictionless interaction between various microservices their APIs have been defined using the OpenAPI 3  specification, which allows the precise definition of RESTful interfaces in a commonly understood format. The new approach for implementing the Data Harvester requires a specialised frontend, which allows the orchestration, configuration and execution of a specific harvesting process (pipe). This frontend is to be developed for the next release of the AEGIS platform.
The Data Annotator (Metadata) was integrated into the AEGIS platform core frontend, by extending the AngularJS frontend. It allows the simple provision of detailed and semantic valuable metadata for projects, data sets and files in the AEGIS platform. In the following releases the Data Annotator will be extended to support more detailed metadata.
In the second release of the AEGIS platform, for the data cleansing tool a two-fold approach is followed: (a) an offline cleansing tool, residing where the data are located, that provides the necessary cleansing processes and functionalities with a variety of techniques to enable data validation, data cleansing and data completion to the data prior to importing them in the AEGIS platform and (b) an online cleansing tool for data cleansing and manipulation with a set of functionalities offered during the data query creation process, addressing certain simple cleansing tasks that are time computationally intense leveraging the computational power of the AEGIS platform. Concerning the online cleansing tool, the offered functionalities are incorporated inside the Query Builder component (see the corresponding section below). The purpose of the offline cleansing tool is to provide an easily customisable and adaptable to the user’s needs tool that will enhance user experience and facilitate the users to accomplish the required cleansing tasks. The tool is implemented as a standalone application written in Python using the Flask microframework and a set of libraries such as Pandas and NumPy. In the first prototype version of the offline cleansing tool the rules for data validation, data cleansing and missing data handling can be set according to the user’s needs. Additionally, the user is able to review the results of the execution of the cleansing tasks through the user interface. The offline cleansing tool is also offering a REST-API with the appropriate endpoints facilitating the uploading of the dataset that will be used in the cleansing process and the execution of the cleansing tasks. The upcoming versions of the tool will contain several enhancements and updates in terms of cleansing functionalities and rules definition.
The anonymisation tool is an extensible, schema-agnostic plugin that allows real-time efficient data anonymisation. The anonymisation tool has been utilized for offline, private usage but offers the ability to output the anonymized data through a secured, web API. Τhe anonymization either syncs with private database servers or imports files from the filesystem and executes anonymisation functions on datasets of various sizes with little or no overhead. The purpose of the anonymisation is to allow the exploitation of the raw data in the system by accounting for privacy concerns and legal limitations.
The tool allows you to setup or edit an existing, saved configuration. By creating a new configuration the system will prompt the user to connect to the private database backend or select the file to open and select the entities/tables to anonymise. The system then prompts the user to select which fields from the data source to expose to the anonymised set, as well as the anonymisation function to be performed. The user is able to execute queries and test the anonymised output through the integrated console of the tool. The anonymised (output) data can be exposed through API to external parties in a secure way through the provision of access keys. Users can access the anonymised data in JSON format through API provided by the anonymization tool using their private access keys.
The brokerage engine of AEGIS acts as a trusted way to record and keep a log of transactions over the AEGIS platforms, which mostly have to do with the sharing of the different data assets. The current implementation of the brokerage engine is based on the Hyperledger Fabrik open source distribution. The versions of the transaction models supported are based on the Data Policy Framework and have been optimised to reflect the main points of the Business Brokerage Framework. The Blockchain Engine is deployed on GFT servers and goes together with the overall AEGIS distribution, facilitating in this manner the transactions that are happening over the platform. Connection with the platform is based on the REST API interface exposed by the Blockchain Engine, where essential actions on the platform, such as user creation and dataset sharing are also recorded in the Blockchain ledger. The Brokerage engine, after its customisation, has been containerised with the use of Docker, in order to allow for easier and faster deployment. Moreover, this approach will allow multiple nodes of the brokerage engine to be deployed in the different AEGIS clusters that can be set up in the future, to allow the interconnection of those into a single AEGIS distributed ledger in case they are all connected to the same public network and then with specific peer and ordered keys are issued to facilitate interconnection and service discovery.
Hopsworks Integrated Services
The AEGIS platform provides data management and processing, user management, and service monitoring through the use of Hopsworks integrated services. Hopsworks introduces the notion of Project, Dataset, and User to enable multi-tenancy within the context of data management. Data processing includes data parallel processing services such as MapReduce, Spark, and Flink, as well as interactive analytics using notebooks such as Zeppelin and Jupyter. Full-text search capability is offered by the included ELK stack component (Elasticsearch). Real time analytics is enabled by the use of the included Kafka service.
Elasticsearch, one of the ELK stack components, is used to provide full text search capabilities to explore projects and datasets within the AEGIS platform. The search space available to each of the users depends on the context from which the user searches. For example, within the context of the home page, the search space includes public datasets, projects, and private datasets. When inside the project, the scope of the search is reduced to the project’s datasets including the shared datasets from other projects. Thus, the search function includes all the projects, where a user is involved, as well as all the datasets included in the projects, shared or owned by the project, as well as public datasets.
Metadata Service (formerly AEGIS Linked Data Store)
The Metadata Service is responsible for storing the metadata associated with a particular dataset within the AEGIS platform. This metadata poses the foundation of the processing of the data within the AEGIS platform. It is based on the AEGIS ontology and vocabulary. For the second release, the component was further developed, enhanced and better integrated.
The foundation of the Metadata Service is the Apache Fuseki Triplestore. It can be directly accessed here:
It offers multiple standardised Linked Data interfaces, like SPARQL or the Graph Store HTTP Protocol. These interfaces can be utilised from other components of the AEGIS platform, especially the Query Builder. Fuseki only acts as a storage layer and is not supposed to be accessed directly by the users or any other component, with an exception to the SPARQL interface, which can be used for executing complex queries against the metadata of the AEGIS platform. Therefore, the AEGIS ontologies are publicly available: http://aegis.fokus.fraunhofer.de/.
The Triplestore only offers Linked Data interfaces and no rich management functionalities. Therefore, an additional service is required, providing additional functionalities for the management of the metadata. This includes particularly the straight-forward creation of metadatasets. A first prototype is available here:
It interacts with the Fuseki triplestore and offers a simple JSON-based REST-API for creating, deleting and updating metadata. It maps the JSON input to the Linked Data structures defined by the AEGIS ontology. For the second release, a basic recommendation service was implemented. It suggests suitable and similar datasets based on an input dataset. Therefore, several characteristics of the dataset are matched against the stored metadata, e.g. keywords or the semantic tabular information. For future releases, this feature will be extended and improved. The metadata service is developed in Java, based on the Play Framework.
The Visualiser is the component enabling the advanced visualisation capabilities of the AEGIS platform. In accordance to the latest design and the specification of the component, the purpose of the Visualiser remains two-fold: (1) to provide visualisations of the results generated by the Algorithm Execution Container and (2) to provide visualisations of the results generated by the queries composed and executed by the Query Builder.
The Visualiser through its intuitive and easy-to-use user interface is offering a variety of visualisation formats which spans from simple static charts to interactive charts with multiple layers of information and several customisation options. The current implementation of the Visualiser component supports the following visualisation types:
- Scatter plot
- Pie chart
- Bar chart
- Line chart
- Box plot
- Time series
- Bubble chart
The Visualiser is implemented as a predefined Jupyter notebook. In addition to Jupyter, two Python open source libraries were utilised, namely the Folium and the highcharts libraries. Within the AEGIS platform the Visualiser can be accessed through Jupyter, that is integrated as a service within the AEGIS Front End. The implementation of the execution workflow of the Visualiser component is explained in the following steps:
- At first, when the Visualiser is loaded the user is presented with a list of options in order to define the dataset that will be utilised for the visualisation creation. The user is able to navigate through the list of available datasets within the project’s folders and select the desired dataset. Upon selecting the desired dataset, a preview of the dataset in tabular format is presented to the user.
- In the next step, the user is presented with the list of available visualisation types. Once the desired visualisation type is selected, the user is presented with the list of available parameters for the specific visualisation type. The list of parameters includes a variety of options that spans from the variables that will be used in the visualisation and the titles that will be displayed in the visualisation axis to the selected visualisation’s type specific parameters such as the aggregation function or class variable.
- Once the visualisation type has been selected and the corresponding parameters have been set, the user can trigger the visualisation creation.
Algorithm Execution Container
The Algorithm Execution Container is an interface for data analysis that resides on top of Zeppelin which is one of the most popular notebooks used by data analysts. The overall concept of this module is to accelerate analysis execution by simplifying the steps that data analysts perform, through eliminating the need to author code directly into the notebook. The current implementation of Algorithm Execution Container is based on Angular JS and Python. In terms of algorithms, the container exploits the MLlib machine learning library and exposes 16 different algorithms, which are grouped in 5 different algorithmic families. Upon launch of the container, the Spark interpreter of the AEGIS platform is fired up to power the code that needs to be executed by the Zeppelin notebook. The container takes as input a dataset which has to be formatted accordingly to be ready to be used by the selected algorithm. A point to the specific URL of the input dataset is provided, which has to be a dataset stored in the AEGIS Data Store and should be accessible to the user that is performing the analysis. The user is then able to select the algorithmic family and the specific algorithm to run, and he is presented with the parameters that are relevant to the analysis, so he can provide his preferences. Upon execution, the necessary Zeppelin paragraphs are executed in the background and the user is presented with the final result of his analysis. Simple, predefined visualisation options are also provided by the Zeppelin notebook. The final output of the analysis is automatically stored back in the AEGIS Data Store, while the model that was used for the analysis, is also stored alongside with the analysis results. According to the project development plan, the next version of the Algorithm Execution Container is going to be deployed over the Jupiter notebook, thus allowing the coverage of the two most popular notebook implementations available, while it will allow for the creation of a unified analytics flow within the platform by interconnecting the container to the Query Builder and the Visualiser.
References https://www.aegis-bigdata.eu/  https://vertx.io/  https://www.docker.com/  https://github.com/OAI/OpenAPI-Specification  http://jupyter.org/  http://folium.readthedocs.io/en/latest/  https://www.highcharts.com/
Blog post authors: GFT