Today we are very happy to announce the release of the AEGIS platform v1.
The AEGIS first integrated prototype features a low fidelity, functional mockup version of the AEGIS platform’s backbone, providing a meaningful subset of the functionalities characterizing the devised MVP. It is delivered for early assessment by the end users. A preliminary design work has been performed by realizing a set of mockups showing the main functionalities of the platform in terms of how a user will perform the tasks through the AEGIS platform. In order to accomplish this task, the Ninjamock web tool has been adopted. Ninjamock is a collaborative service for creating and sharing wireframes, allowing users to share projects through private URLs and to export wireframes in PDF format. GitHub has been selected as software development platform, allowing all of the distributed version control and source code management functionality of Git as well as adding its own features (i.e. access control and several collaboration features such as bug tracking, feature requests, task management, and wikis). The AEGIS GitHub repository is available at the following URL: https://github.com/aegisbigdata.
A TESTBED machine has been provided, assigning to the virtual machine’s 8 cores and 60GB, necessary to efficiently run the environment. Testing accounts on the TESTBED machine for the partners have been created. An integration plan was prepared to guide the integration of the backbone infrastructure with the various services and components, focusing on a continuous integration process. At the same time, it was defined a strategy for verification and testing of this platform and of its components: the AEGIS partners responsible for each platform component performs development activities in local machines, test the component [unit test] and release it on GitHub according to the commonly agreed release plan. The system integrator and all the interested technical partners integrate the web components from the AEGIS front-end locally, testing included [integration testing]. After that, the system integrator releases the updates on the AEGIS Github repository. Then a selected partner will deploy the above updates on the TESTBED machine. For each workflow identified in an earlier task of the project, the responsible partners will test their respective workflow on the TESTBED machine [system testing]. After that, the TESTBED machine will be available to the demonstrators [acceptance testing].
Building on top of Hopsworks, a GUI/Front-end has been released, following the look and feel of the AEGIS institutional web site, using as main technologies HTML and AngularJS. When the user accesses the platform for the first time, he/her may enter the account registration, and then proceed with the login. Once the user is logged into the platform, a main page shows on the right side the list of the currently available projects, as well as the button to creating a new one. After a project has been selected, a new page shows the related activity history and the main menu on the left side, including the following items: Data Sets, Queries, Metadata Designer, Visualizations, Analytics and Settings. In addition, on the top-right corner of the page, a search functionality is available (see Figure 1).
Below an overview of the main components of the platform is provided.
Data Harvester and Annotator
The Data Harvester and Data Annotator are loosely coupled with the central AEGIS platform and for development reason not yet fully integrated in the deployment process.
The basis of these components is the AEGIS metadata ontology and vocabulary. Its objective is to describe the available datasets in a detailed fashion in order to support any further processing, especially visualisations and query building. The ontology is based on the DCAT-AP specifications and extended accordingly. The core concepts of DCAT-AP were mapped to the corresponding concepts and data structures of the AEGIS platform in the following way:
- DCAT-AP Catalogues à AEGIS Projects
- DCAT-AP Datasets à AEGIS Data Sets
- DCAT-AP Distributions à AEGIS Files
In addition, the focus of the first prototype is to simply and effectively describe the semantics of tabular data. Therefore, an AEGIS-specific set of Linked Data classes and properties was developed. The overall objective was its simplicity and reusability. The first iteration of the ontology is available on GitHub as Turtle-serialised RDF file.
The Data Harvester is the component enabling the import of data from heterogeneous sources and their transformation to the required data format of the AEGIS platform. For the first prototype the Java-based solution StreamSets Data Collector was employed and adopted to allow the straight-forward composition of harvesting pipelines.
In order to publish data on the AEGIS platform a plugin for StreamSets was developed, which enables the users to choose AEGIS as a destination for a data pipeline. The prototype was successfully tested with retrieving data from CSV-files and transferring them to the AEGIS platform. With the integrated processing tools of StreamSets, rich pre-processing and aggregating of several data sources can be performed. In addition, complex transformation tasks can be done by applying Java Script or Python scripts within the StreamSets web application interface. A first pipeline was implemented, harvesting the data from various CSV files to the AEGIS platform. StreamSets does not offer a convenient feature, for scheduling harvesting pipelines, but a REST-API for starting, stopping and monitoring them. To overcome this downside and allow better integration into the AEGIS platform, the Java-based EDP Metadata Transformer Service was adopted to allow the scheduling and management of StreamSets pipelines.
The tools let the users add REST-endpoints of StreamSets and configure a scheduling plan for their execution. It offers a REST-API as well, allowing the immediate integration of the functionalities into the AEGIS platform.
The annotator is the component in the AEGIS platform that is responsible for interactively equipping input data with suitable metadata. It is not developed yet, but its functionality is already outlined in an interactive mockup. The annotator will be an interactive metadata editor with a graphical user interface. It will be based on DCAT-AP and the AEGIS ontology and tightly interact with all components of the AEGIS data value chain. Within the data value chain, the annotator will be available once a dataset is present in the AEGIS platform.
The beforehand described services and tools will be integrated into the AEGIS platform. The annotator user interface and the scheduling functionalities will be directly available from the AEGIS frontend. Hence, they will be integrated into the Angular.js frontend application. This will include communication to the respective backend counterparts, like the AEGIS Linked Data Store.
The AEGIS Cleansing Tool has been developed according to a two-fold approach that includes (a) simple data cleansing transformations to be applied through existing mature tools offline and (b) more complex data cleansing (e.g. outlier detection and removal) through dedicated cleansing processes that could be implemented through the Algorithm Execution Container. Furtherly, in order to provide a more intuitive user experience and also leverage the computational power of the AEGIS system, it was decided to make certain simple data cleansing functionalities available to the user during the data query creation process, i.e. when he should be more confident about the desired data manipulation needed to perform in order to use the data for further analysis. Hence, the cleansing tool for the first AEGIS prototype is incorporated in the Query Builder.
The anonymization tool is an extensible, schema-agnostic plugin that allows real-time efficient data anonymisation. The anonymization tool has been utilized for offline, private usage but offers the ability to output the anonymized data through a secured, web API. With emphasis on performance, the anonymization syncs with private database servers and executes anonymization functions on datasets of various sizes with little or no overhead. The purpose of the anonymization is to enable the potential value of raw data in the system by accounting for privacy concerns and legal limitations. The anonymization system comes with a list of predefined anonymization functions that can be used directly (e.g. city from an exact address, range of values from an integer), as well as a list of aggregation functions (e.g. average). Besides, the tool can be easily extended with any new, custom anonymization function defined by the user in a python module. The user is able to execute queries and test the anonymized output through the integrated console of the anonymization tool. The anonymized, output data can be exposed through API to external parties in a secure way through the provision of access keys. Users can access the anonymized data in JSON format through API provided by the anonymization tool using their private access keys.
The brokerage engine of AEGIS acts as a trusted way to log transactions over the AEGIS platform which mostly have to do with the usage of the different data assets. The brokerage engine is based on the Hyperledger Fabrik implementation and the initial versions of the models supported are based on the Data Policy Framework. These models are expected to be updated in the next release of the platform and also the most important descriptors to be exposed through the AEGIS metadata editor, in order to accompany each data asset with the necessary model attributes for the brokerage engine to perform. Upon execution of each transaction that will be linked to the distributed ledger, a specific block is created and the transaction is then validated and appears in the ledger. The current implementation of the brokerage engine platform exposes a REST API which will be integrated to the AEGIS core platform in order for the latter to call the brokerage engine and perform any necessary writes/reads to/from the blockchain infrastructure.
Hopsworks Integrated Services
The AEGIS platform uses Hopsworks integrated services for data management and processing. Hopsworks provides a multi-tenant project platform for secure collaborative data management between users. It provides an integrated support for different data parallel processing such as MapReduce, Spark, and Flink, and interactive notebook using Zeppelin, as well as other ELK stack components such as Elasticsearch. Hopsworks introduces the notion of Project, Dataset, and User. A Dataset is a directory in the underlying file system (HopsFS) that could contains as many files and directories, and it could be shared between projects. A Project is a collection of datasets, notebooks, users, and jobs (Flink, Spark, Flink). A User in Hopsworks can create multiple project, share projects with other users, create datasets, and upload/download/preview data in his/her dataset. Below we show the main components provided by Hopsworks.
A new user of the platform is required to first register with AEGIS, by providing: first name, last name, email, password and a security question/answer. Once this step is performed, an AEGIS administrator can search and validate the new account with the appropriate roles. Once the account is active, the user can log in with her credentials on the AEGIS platform.
Once the user is logged in, he/she can start using the AEGIS platform by first creating a project. The user provides the project name, a description, and potential list of members. Then, the user can navigate to the project ‘AEGIS’ where a list of existing datasets is provided. Inside the project, on the left bar there are tools that the users can use to interact/process their data such as Queries, Visualizations, and Analytics. To start interacting with these tools, the users first create their own datasets, then upload their data inside. Inside the dataset, the users can create folders, upload files, and then navigate through the created structure. The uploaded files can then be viewed on the platform, or downloaded.
Hopsworks includes support for a number of the components of the ELK stack, such as Elasticsearch, Kibana and Logstash. Out of these services, the first prototype of the AEGIS platform utilizes the Elasticsearch component to provide full text search capabilities to explore projects and datasets. The search space available to each of the users depends on the visibility of the datasets. Once the visibility of a dataset is set to public, it will be part of the search space of any user of the AEGIS platform. Datasets can also be shared directly between two projects. Thus, the search function includes all the projects, where a user is involved, as well as all the datasets included in the projects, shared or owned by the project, as well as public datasets.
AEGIS Linked Data Store
The AEGIS Linked Data is responsible for storing the metadata associated with a particular dataset within the AEGIS platform. This metadata poses the foundation of the processing of the data within the AEGIS platform. It is based on the AEGIS ontology and vocabulary. It is composed out of two components: a Triplestore and a thin management layer for convenient access to the Linked Data. For the first prototype the Apache Fuseki was deployed as a Triplestore. It offers multiple standardised Linked Data interfaces, like SPARQL or the Graph Store HTTP Protocol. These interfaces can be utilised from other components of the AEGIS platform, especially the Query Builder. The Triplestore only offers (complex) Linked Data interfaces and no rich management functionalities. Therefore, an additional service is required, providing additional functionalities for the management of the metadata. This includes particularly the straight-forward creation of metadatasets. It interacts with the Fuseki triplestore and offers a simple JSON-based REST-API for creating, deleting and updating metadata. It maps the JSON input to the Linked Data structures defined by the AEGIS ontology. The AEGIS Linked Data Store requires tight integration into the AEGIS platform, since the metadata is present throughout the entire data value chain, from providing until visualizing data. For creating the metadata the Annotator will be integrated into the AEGIS Angular.js frontend and communicate to Linked Data Store via the metadata service. Since the metadata and the actual data are store in different service a synchronisation mechanism will be integrated, ensuring that for each metadata record the respective data is available and vice versa. This will be done by implementing hooks into the AEGIS core platform, which fire events to the metadata service whenever data is modified or deleted.
Algorithm Execution Container
The algorithm execution container engine of AEGIS is the central computation environment where data analysis takes place and is used by data analysts in order to run the analyses required. The scope of Algorithm Execution Container is to provide data analysts with a visual UI that will help them to run analyses, without requiring them to write the analyses in specific execution languages, such as PySpark, R or Scala. As such, the data analyst can concentrate on the analysis to be executed, and not invest resources in the functional programming of the analysis. The Algorithm Execution Container environment is built on top of Zeppelin as a notebook and is interconnected to the backbone of the AEGIS platform, utilising specific data analysis and machine learning algorithms to perform the analyses. In this context, it takes as input datasets coming out of the Query Builder, or directly from the HDFS and loads a UI on top of Zeppelin. The outputs of each analysis can then be stored locally on HDFS, or (in case an algorithm is producing results that are meaningful to be visualised) forwarded to the Visualiser. Moreover, after each execution, the Algorithm Execution Container is showcasing some algorithm specific metrics which can be used to assess the performance of each algorithm.
Blog post authors: GFT