AEGIS Releases Platform v2

posted in: Blog | 0

The second release of the AEGIS integrated prototype features a partially functional high fidelity software prototype connected to a deployed version of the platform, providing an initial interface including the basic UI/UX for the users of the platform.

The GUI/Front-end has been substantially improved according to the initial feedback received, still based upon the AEGIS institutional web site [1]. When the user accesses the platform for the first time, he/her may enter the account registration, and then proceed with the login. Once the user is logged into the platform, a main page shows two tabs: one for the Assets, the other for the Public Datasets. On the right side it is showed the list of the user projects, as well as the button to creating a new one. After a project has been selected, a new page shows the related activity history and the main menu on the left side, including the following items: Get Started, Assets, Project Datasets, Query Builder, Analytics, Visualiser, Jupyter, Zeppelin, Kafka, Jobs, Metadata Designer, Settings, Members, Project Metadata. In addition, on the top-right corner of the page, a search functionality is available (see Figure 1).

Figure 1– AEGIS integrated prototype GUI

 

Below an overview of the main components of the platform is provided.

Data Harvester and Annotator

These are two tightly coupled components that facilitate the import of data and metadata from a variety of heterogeneous sources, undertaking also the necessary transformation actions towards the required data format and structure. After the initial testing, the StreamSets platform (see https://www.aegis-bigdata.eu/releasing-the-aegis-platform-v1/) proved to not be flexible enough to enable a generic use within the current ecosystem. Thus, despite being a powerful tool, StreamSets was dropped in favour of extending the software that has previously been written by Fraunhofer FOKUS.

Data Harvester Microservices

In order to tackle the mentioned problems a different approach was chosen. Instead of using a third-party software, the Java based EDP Metadata Transformer Service (EMTS) has been split up into its components, transforming the monolithic application into a microservice architecture, consisting out of the following four basic services:

  • Importer– implements all functionality for retrieving data from a specific data source
  • Transformer– converts the retrieved data from an importer into the target format of the AEGIS platform
  • Aggregator– collects converted data from a transformer over a configurable time interval
  • Exporter– uploads transformed and/or aggregated data to the AEGIS platform

To name a concrete example: Weather data may be imported hourly from an external service as JSON, which is then transformed into CSV with values being converted into the metric system. The results are then aggregated for 24 hours and the final file exported to an analytics platform.

Data Harvester Technology

In order to reuse as much as code as possible, the microservices introduced in the previous section have also been written in Java. To aid with the creation of each service the framework Eclipse Vert.x [2]has been chosen. Apart from propagating an asynchronous programming paradigm ensuring a high availability and throughput, Eclipse Vert.x also provides a lot of built in support for managing microservices. The final services are deployed via containerization, in which each service is isolated from its host and other service’s operating systems. The services are thereby packaged separately as images. The technology used for building these images is called Docker [3]. Aside from enhancing security this method also ensures platform independency, which means that these images (and consequently the services in question) will run on any infrastructure providing a Docker runtime. To handle the frictionless interaction between various microservices their APIs have been defined using the OpenAPI 3 [4] specification, which allows the precise definition of RESTful interfaces in a commonly understood format. The new approach for implementing the Data Harvester requires a specialised frontend, which allows the orchestration, configuration and execution of a specific harvesting process (pipe). This frontend is to be developed for the next release of the AEGIS platform.

Data Annotator

The Data Annotator (Metadata) was integrated into the AEGIS platform core frontend, by extending the AngularJS frontend. It allows the simple provision of detailed and semantic valuable metadata for projects, data sets and files in the AEGIS platform. In the following releases the Data Annotator will be extended to support more detailed metadata.

Cleansing Tool

In the second release of the AEGIS platform, for the data cleansing tool a two-fold approach is followed: (a) an offline cleansing tool, residing where the data are located, that provides the necessary cleansing processes and functionalities with a variety of techniques to enable data validation, data cleansing and data completion to the data prior to importing them in the AEGIS platform and (b) an online cleansing tool for data cleansing and manipulation with a set of functionalities offered during the data query creation process, addressing certain simple cleansing tasks that are time computationally intense leveraging the computational power of the AEGIS platform. Concerning the online cleansing tool, the offered functionalities are incorporated inside the Query Builder component (see the corresponding section below). The purpose of the offline cleansing tool is to provide an easily customisable and adaptable to the user’s needs tool that will enhance user experience and facilitate the users to accomplish the required cleansing tasks. The tool is implemented as a standalone application written in Python using the Flask microframework  and a set of libraries such as Pandas  and NumPy. In the first prototype version of the offline cleansing tool the rules for data validation, data cleansing and missing data handling can be set according to the user’s needs. Additionally, the user is able to review the results of the execution of the cleansing tasks through the user interface. The offline cleansing tool is also offering a REST-API with the appropriate endpoints facilitating the uploading of the dataset that will be used in the cleansing process and the execution of the cleansing tasks. The upcoming versions of the tool will contain several enhancements and updates in terms of cleansing functionalities and rules definition.

Anonymisation Tool

The anonymisation tool is an extensible, schema-agnostic plugin that allows real-time efficient data anonymisation. The anonymisation tool has been utilized for offline, private usage but offers the ability to output the anonymized data through a secured, web API. Τhe anonymization either syncs with private database servers or imports files from the filesystem and executes anonymisation functions on datasets of various sizes with little or no overhead. The purpose of the anonymisation is to allow the exploitation of the raw data in the system by accounting for privacy concerns and legal limitations.

The tool allows you to setup or edit an existing, saved configuration. By creating a new configuration the system will prompt the user to connect to the private database backend or select the file to open and select the entities/tables to anonymise. The system then prompts the user to select which fields from the data source to expose to the anonymised set, as well as the anonymisation function to be performed. The user is able to execute queries and test the anonymised output through the integrated console of the tool. The anonymised (output) data can be exposed through API to external parties in a secure way through the provision of access keys. Users can access the anonymised data in JSON format through API provided by the anonymization tool using their private access keys.

Brokerage Engine

The brokerage engine of AEGIS acts as a trusted way to record and keep a log of transactions over the AEGIS platforms, which mostly have to do with the sharing of the different data assets. The current implementation of the brokerage engine is based on the Hyperledger Fabrik open source distribution. The versions of the transaction models supported are based on the Data Policy Framework and have been optimised to reflect the main points of the Business Brokerage Framework. The Blockchain Engine is deployed on GFT servers and goes together with the overall AEGIS distribution, facilitating in this manner the transactions that are happening over the platform. Connection with the platform is based on the REST API interface exposed by the Blockchain Engine, where essential actions on the platform, such as user creation and dataset sharing are also recorded in the Blockchain ledger. The Brokerage engine, after its customisation, has been containerised with the use of Docker, in order to allow for easier and faster deployment. Moreover, this approach will allow multiple nodes of the brokerage engine to be deployed in the different AEGIS clusters that can be set up in the future, to allow the interconnection of those into a single AEGIS distributed ledger in case they are all connected to the same public network and then with specific peer and ordered keys are issued to facilitate interconnection and service discovery.

Hopsworks Integrated Services

The AEGIS platform provides data management and processing, user management, and service monitoring through the use of Hopsworks integrated services. Hopsworks introduces the notion of Project, Dataset, and User to enable multi-tenancy within the context of data management. Data processing includes data parallel processing services such as MapReduce, Spark, and Flink, as well as interactive analytics using notebooks such as Zeppelin and Jupyter. Full-text search capability is offered by the included ELK stack component (Elasticsearch). Real time analytics is enabled by the use of the included Kafka service.

Full-Text Search

Elasticsearch, one of the ELK stack components, is used to provide full text search capabilities to explore projects and datasets within the AEGIS platform. The search space available to each of the users depends on the context from which the user searches. For example, within the context of the home page, the search space includes public datasets, projects, and private datasets. When inside the project, the scope of the search is reduced to the project’s datasets including the shared datasets from other projects. Thus, the search function includes all the projects, where a user is involved, as well as all the datasets included in the projects, shared or owned by the project, as well as public datasets.

Metadata Service (formerly AEGIS Linked Data Store)

The Metadata Service is responsible for storing the metadata associated with a particular dataset within the AEGIS platform. This metadata poses the foundation of the processing of the data within the AEGIS platform. It is based on the AEGIS ontology and vocabulary. For the second release, the component was further developed, enhanced and better integrated.

Triplestore

The foundation of the Metadata Service is the Apache Fuseki Triplestore. It can be directly accessed here:

http://aegis-store.fokus.fraunhofer.de.

It offers multiple standardised Linked Data interfaces, like SPARQL or the Graph Store HTTP Protocol. These interfaces can be utilised from other components of the AEGIS platform, especially the Query Builder. Fuseki only acts as a storage layer and is not supposed to be accessed directly by the users or any other component, with an exception to the SPARQL interface, which can be used for executing complex queries against the metadata of the AEGIS platform. Therefore, the AEGIS ontologies are publicly available: http://aegis.fokus.fraunhofer.de/.

Metadata Service

The Triplestore only offers Linked Data interfaces and no rich management functionalities. Therefore, an additional service is required, providing additional functionalities for the management of the metadata. This includes particularly the straight-forward creation of metadatasets. A first prototype is available here:

http://aegis-metadata.fokus.fraunhofer.de.

It interacts with the Fuseki triplestore and offers a simple JSON-based REST-API for creating, deleting and updating metadata. It maps the JSON input to the Linked Data structures defined by the AEGIS ontology. For the second release, a basic recommendation service was implemented. It suggests suitable and similar datasets based on an input dataset. Therefore, several characteristics of the dataset are matched against the stored metadata, e.g. keywords or the semantic tabular information. For future releases, this feature will be extended and improved. The metadata service is developed in Java, based on the Play Framework.

Query Builder

Query Builder provides the capability to interactively define and execute queries on data available in the AEGIS system. It is primarily addressed to the AEGIS users with limited technical background, but potentially useful for all, as it simplifies and accelerates the process of retrieving data and creating views on them. Query Builder also offers some simple data cleansing functionalities. In its previous version, Query Builder was implemented in the form of a note inside the Apache Zeppelin notebook, using mainly PySpark for the data manipulation and Javascript and Angular JS for the user interface. In the next version, the tool switched to the Jupyter Notebook as more appropriate. The tool, potentially in slightly different flavours (to allow for more customization of the data manipulation processes) will directly accessible inside every newly created project through the Jupyter Notebook. Query Builder leverages the metadata available for each file in the AEGIS system in order to provide its enhanced data selection and manipulation capabilities, i.e. enabling/disabling certain data merging and filtering options according to the data schema and also allowing the user to perform more targeted dataset exploration and retrieval based on the available metadata. The new Query Builder version is fully integrated with the Metadata Service, hence the available datasets and corresponding information are taken from the service and presented to the user. The user can initially select an interesting dataset and then browse its files in order to choose which one to load. Once the selected file is opened, it is loaded as a Spark Dataframe, called temp dataset (tempDF), and it is available for further data manipulation. At all times, the user can have up to two different datasets active in Query Builder: the temporary (temp) and the master. The temporary is the one currently being used and changed, whereas the master is used as a “storage point” for intermediate results while the user is processing data and as the final result once all data manipulation is over and the user is satisfied with the outcome. A number of filters and data processing methods is available for the user to select and apply on the temporary dataset, through the “Controls” panel. Indicatively, the user can fill in null values, filter out entries based on values, rename columns, replace values, select/drop columns etc. The user may also merge or append the temporary dataset with the master dataset. A list of selected data manipulation actions, either already applied or pending application, is always visible under the “Selected filters” panel. When the result of a series of data processing tasks on the temporary dataset is satisfactory, it can be saved as the master dataset. The user may continue processing the same temporary dataset or open a new one or, when the query creation process is complete, can save the master dataset as a new csv file or export the query and continue to directly change the generated code. Finally, the result of the data manipulation, i.e. the master dataset, can be directly passed as input to more high-level AEGIS tools, like the Visualiser and the Algorithm Execution Container.

Visualiser

The Visualiser is the component enabling the advanced visualisation capabilities of the AEGIS platform. In accordance to the latest design and the specification of the component, the purpose of the Visualiser remains two-fold: (1) to provide visualisations of the results generated by the Algorithm Execution Container and (2) to provide visualisations of the results generated by the queries composed and executed by the Query Builder.

The Visualiser through its intuitive and easy-to-use user interface is offering a variety of visualisation formats which spans from simple static charts to interactive charts with multiple layers of information and several customisation options. The current implementation of the Visualiser component supports the following visualisation types:

  • Scatter plot
  • Pie chart
  • Bar chart
  • Line chart
  • Box plot
  • Histogram
  • Time series
  • Heatmap
  • Bubble chart
  • Map

The Visualiser is implemented as a predefined Jupyter [5]notebook. In addition to Jupyter, two Python open source libraries were utilised, namely the Folium [6]and the highcharts [7]libraries. Within the AEGIS platform the Visualiser can be accessed through Jupyter, that is integrated as a service within the AEGIS Front End. The implementation of the execution workflow of the Visualiser component is explained in the following steps:

  1. At first, when the Visualiser is loaded the user is presented with a list of options in order to define the dataset that will be utilised for the visualisation creation. The user is able to navigate through the list of available datasets within the project’s folders and select the desired dataset. Upon selecting the desired dataset, a preview of the dataset in tabular format is presented to the user.
  2. In the next step, the user is presented with the list of available visualisation types. Once the desired visualisation type is selected, the user is presented with the list of available parameters for the specific visualisation type. The list of parameters includes a variety of options that spans from the variables that will be used in the visualisation and the titles that will be displayed in the visualisation axis to the selected visualisation’s type specific parameters such as the aggregation function or class variable.
  3. Once the visualisation type has been selected and the corresponding parameters have been set, the user can trigger the visualisation creation.

Algorithm Execution Container

The Algorithm Execution Container is an interface for data analysis that resides on top of Zeppelin which is one of the most popular notebooks used by data analysts. The overall concept of this module is to accelerate analysis execution by simplifying the steps that data analysts perform, through eliminating the need to author code directly into the notebook. The current implementation of Algorithm Execution Container is based on Angular JS and Python. In terms of algorithms, the container exploits the MLlib machine learning library and exposes 16 different algorithms, which are grouped in 5 different algorithmic families. Upon launch of the container, the Spark interpreter of the AEGIS platform is fired up to power the code that needs to be executed by the Zeppelin notebook. The container takes as input a dataset which has to be formatted accordingly to be ready to be used by the selected algorithm. A point to the specific URL of the input dataset is provided, which has to be a dataset stored in the AEGIS Data Store and should be accessible to the user that is performing the analysis. The user is then able to select the algorithmic family and the specific algorithm to run, and he is presented with the parameters that are relevant to the analysis, so he can provide his preferences. Upon execution, the necessary Zeppelin paragraphs are executed in the background and the user is presented with the final result of his analysis. Simple, predefined visualisation options are also provided by the Zeppelin notebook. The final output of the analysis is automatically stored back in the AEGIS Data Store, while the model that was used for the analysis, is also stored alongside with the analysis results. According to the project development plan, the next version of the Algorithm Execution Container is going to be deployed over the Jupiter notebook, thus allowing the coverage of the two most popular notebook implementations available, while it will allow for the creation of a unified analytics flow within the platform by interconnecting the container to the Query Builder and the Visualiser.

References

[1] https://www.aegis-bigdata.eu/

[2] https://vertx.io/

[3] https://www.docker.com/

[4] https://github.com/OAI/OpenAPI-Specification

[5] http://jupyter.org/

[6] http://folium.readthedocs.io/en/latest/

[7] https://www.highcharts.com/

 

Blog post authors: GFT