Today we are very happy to announce the release of the AEGIS platform v1.
The AEGIS first integrated prototype features a low fidelity, functional mockup version of the AEGIS platform’s backbone, providing a meaningful subset of the functionalities characterizing the devised MVP. It is delivered for early assessment by the end users. A preliminary design work has been performed by realizing a set of mockups showing the main functionalities of the platform in terms of how a user will perform the tasks through the AEGIS platform. In order to accomplish this task, the Ninjamock web tool has been adopted. Ninjamock is a collaborative service for creating and sharing wireframes, allowing users to share projects through private URLs and to export wireframes in PDF format. GitHub has been selected as software development platform, allowing all of the distributed version control and source code management functionality of Git as well as adding its own features (i.e. access control and several collaboration features such as bug tracking, feature requests, task management, and wikis). The AEGIS GitHub repository is available at the following URL: https://github.com/aegisbigdata.
A TESTBED machine has been provided, assigning to the virtual machine’s 8 cores and 60GB, necessary to efficiently run the environment. Testing accounts on the TESTBED machine for the partners have been created. An integration plan was prepared to guide the integration of the backbone infrastructure with the various services and components, focusing on a continuous integration process. At the same time, it was defined a strategy for verification and testing of this platform and of its components: the AEGIS partners responsible for each platform component performs development activities in local machines, test the component [unit test] and release it on GitHub according to the commonly agreed release plan. The system integrator and all the interested technical partners integrate the web components from the AEGIS front-end locally, testing included [integration testing]. After that, the system integrator releases the updates on the AEGIS Github repository. Then a selected partner will deploy the above updates on the TESTBED machine. For each workflow identified in an earlier task of the project, the responsible partners will test their respective workflow on the TESTBED machine [system testing]. After that, the TESTBED machine will be available to the demonstrators [acceptance testing].
Building on top of Hopsworks, a GUI/Front-end has been released, following the look and feel of the AEGIS institutional web site, using as main technologies HTML and AngularJS. When the user accesses the platform for the first time, he/her may enter the account registration, and then proceed with the login. Once the user is logged into the platform, a main page shows on the right side the list of the currently available projects, as well as the button to creating a new one. After a project has been selected, a new page shows the related activity history and the main menu on the left side, including the following items: Data Sets, Queries, Metadata Designer, Visualizations, Analytics and Settings. In addition, on the top-right corner of the page, a search functionality is available (see Figure 1).
Below an overview of the main components of the platform is provided.
Data Harvester and Annotator
The Data Harvester and Data Annotator are loosely coupled with the central AEGIS platform and for development reason not yet fully integrated in the deployment process.
Metadata Ontology
The basis of these components is the AEGIS metadata ontology and vocabulary. Its objective is to describe the available datasets in a detailed fashion in order to support any further processing, especially visualisations and query building. The ontology is based on the DCAT-AP specifications and extended accordingly. The core concepts of DCAT-AP were mapped to the corresponding concepts and data structures of the AEGIS platform in the following way:
- DCAT-AP Catalogues à AEGIS Projects
- DCAT-AP Datasets à AEGIS Data Sets
- DCAT-AP Distributions à AEGIS Files
In addition, the focus of the first prototype is to simply and effectively describe the semantics of tabular data. Therefore, an AEGIS-specific set of Linked Data classes and properties was developed. The overall objective was its simplicity and reusability. The first iteration of the ontology is available on GitHub as Turtle-serialised RDF file.
Data Harvester
The Data Harvester is the component enabling the import of data from heterogeneous sources and their transformation to the required data format of the AEGIS platform. For the first prototype the Java-based solution StreamSets Data Collector was employed and adopted to allow the straight-forward composition of harvesting pipelines.
In order to publish data on the AEGIS platform a plugin for StreamSets was developed[3], which enables the users to choose AEGIS as a destination for a data pipeline. The prototype was successfully tested with retrieving data from CSV-files and transferring them to the AEGIS platform. With the integrated processing tools of StreamSets, rich pre-processing and aggregating of several data sources can be performed. In addition, complex transformation tasks can be done by applying Java Script or Python scripts within the StreamSets web application interface. A first pipeline was implemented, harvesting the data from various CSV files to the AEGIS platform. StreamSets does not offer a convenient feature, for scheduling harvesting pipelines, but a REST-API for starting, stopping and monitoring them. To overcome this downside and allow better integration into the AEGIS platform, the Java-based EDP Metadata Transformer Service was adopted to allow the scheduling and management of StreamSets pipelines.
The tools let the users add REST-endpoints of StreamSets and configure a scheduling plan for their execution. It offers a REST-API as well, allowing the immediate integration of the functionalities into the AEGIS platform.
Annotator
The annotator is the component in the AEGIS platform that is responsible for interactively equipping input data with suitable metadata. It is not developed yet, but its functionality is already outlined in an interactive mockup. The annotator will be an interactive metadata editor with a graphical user interface. It will be based on DCAT-AP and the AEGIS ontology and tightly interact with all components of the AEGIS data value chain. Within the data value chain, the annotator will be available once a dataset is present in the AEGIS platform.
The beforehand described services and tools will be integrated into the AEGIS platform. The annotator user interface and the scheduling functionalities will be directly available from the AEGIS frontend. Hence, they will be integrated into the Angular.js frontend application. This will include communication to the respective backend counterparts, like the AEGIS Linked Data Store.
Cleansing Tool
The AEGIS Cleansing Tool has been developed according to a two-fold approach that includes (a) simple data cleansing transformations to be applied through existing mature tools offline and (b) more complex data cleansing (e.g. outlier detection and removal) through dedicated cleansing processes that could be implemented through the Algorithm Execution Container. Furtherly, in order to provide a more intuitive user experience and also leverage the computational power of the AEGIS system, it was decided to make certain simple data cleansing functionalities available to the user during the data query creation process, i.e. when he should be more confident about the desired data manipulation needed to perform in order to use the data for further analysis. Hence, the cleansing tool for the first AEGIS prototype is incorporated in the Query Builder.
Anonymisation Tool
The anonymization tool is an extensible, schema-agnostic plugin that allows real-time efficient data anonymisation. The anonymization tool has been utilized for offline, private usage but offers the ability to output the anonymized data through a secured, web API. With emphasis on performance, the anonymization syncs with private database servers and executes anonymization functions on datasets of various sizes with little or no overhead. The purpose of the anonymization is to enable the potential value of raw data in the system by accounting for privacy concerns and legal limitations. The anonymization system comes with a list of predefined anonymization functions that can be used directly (e.g. city from an exact address, range of values from an integer), as well as a list of aggregation functions (e.g. average). Besides, the tool can be easily extended with any new, custom anonymization function defined by the user in a python module. The user is able to execute queries and test the anonymized output through the integrated console of the anonymization tool. The anonymized, output data can be exposed through API to external parties in a secure way through the provision of access keys. Users can access the anonymized data in JSON format through API provided by the anonymization tool using their private access keys.
Brokerage Engine
The brokerage engine of AEGIS acts as a trusted way to log transactions over the AEGIS platform which mostly have to do with the usage of the different data assets. The brokerage engine is based on the Hyperledger Fabrik implementation and the initial versions of the models supported are based on the Data Policy Framework. These models are expected to be updated in the next release of the platform and also the most important descriptors to be exposed through the AEGIS metadata editor, in order to accompany each data asset with the necessary model attributes for the brokerage engine to perform. Upon execution of each transaction that will be linked to the distributed ledger, a specific block is created and the transaction is then validated and appears in the ledger. The current implementation of the brokerage engine platform exposes a REST API which will be integrated to the AEGIS core platform in order for the latter to call the brokerage engine and perform any necessary writes/reads to/from the blockchain infrastructure.
Hopsworks Integrated Services
The AEGIS platform uses Hopsworks integrated services for data management and processing. Hopsworks provides a multi-tenant project platform for secure collaborative data management between users. It provides an integrated support for different data parallel processing such as MapReduce, Spark, and Flink, and interactive notebook using Zeppelin, as well as other ELK stack components such as Elasticsearch. Hopsworks introduces the notion of Project, Dataset, and User. A Dataset is a directory in the underlying file system (HopsFS) that could contains as many files and directories, and it could be shared between projects. A Project is a collection of datasets, notebooks, users, and jobs (Flink, Spark, Flink). A User in Hopsworks can create multiple project, share projects with other users, create datasets, and upload/download/preview data in his/her dataset. Below we show the main components provided by Hopsworks.
Users API
A new user of the platform is required to first register with AEGIS, by providing: first name, last name, email, password and a security question/answer. Once this step is performed, an AEGIS administrator can search and validate the new account with the appropriate roles. Once the account is active, the user can log in with her credentials on the AEGIS platform.
Once the user is logged in, he/she can start using the AEGIS platform by first creating a project. The user provides the project name, a description, and potential list of members. Then, the user can navigate to the project ‘AEGIS’ where a list of existing datasets is provided. Inside the project, on the left bar there are tools that the users can use to interact/process their data such as Queries, Visualizations, and Analytics. To start interacting with these tools, the users first create their own datasets, then upload their data inside. Inside the dataset, the users can create folders, upload files, and then navigate through the created structure. The uploaded files can then be viewed on the platform, or downloaded.
Hopsworks includes support for a number of the components of the ELK stack, such as Elasticsearch, Kibana and Logstash. Out of these services, the first prototype of the AEGIS platform utilizes the Elasticsearch component to provide full text search capabilities to explore projects and datasets. The search space available to each of the users depends on the visibility of the datasets. Once the visibility of a dataset is set to public, it will be part of the search space of any user of the AEGIS platform. Datasets can also be shared directly between two projects. Thus, the search function includes all the projects, where a user is involved, as well as all the datasets included in the projects, shared or owned by the project, as well as public datasets.
AEGIS Linked Data Store
The AEGIS Linked Data is responsible for storing the metadata associated with a particular dataset within the AEGIS platform. This metadata poses the foundation of the processing of the data within the AEGIS platform. It is based on the AEGIS ontology and vocabulary. It is composed out of two components: a Triplestore and a thin management layer for convenient access to the Linked Data. For the first prototype the Apache Fuseki was deployed as a Triplestore. It offers multiple standardised Linked Data interfaces, like SPARQL or the Graph Store HTTP Protocol. These interfaces can be utilised from other components of the AEGIS platform, especially the Query Builder. The Triplestore only offers (complex) Linked Data interfaces and no rich management functionalities. Therefore, an additional service is required, providing additional functionalities for the management of the metadata. This includes particularly the straight-forward creation of metadatasets. It interacts with the Fuseki triplestore and offers a simple JSON-based REST-API for creating, deleting and updating metadata. It maps the JSON input to the Linked Data structures defined by the AEGIS ontology. The AEGIS Linked Data Store requires tight integration into the AEGIS platform, since the metadata is present throughout the entire data value chain, from providing until visualizing data. For creating the metadata the Annotator will be integrated into the AEGIS Angular.js frontend and communicate to Linked Data Store via the metadata service. Since the metadata and the actual data are store in different service a synchronisation mechanism will be integrated, ensuring that for each metadata record the respective data is available and vice versa. This will be done by implementing hooks into the AEGIS core platform, which fire events to the metadata service whenever data is modified or deleted.
Query Builder
Query Builder provides the capability to interactively define and execute queries on data available in the AEGIS system. It is primarily addressed to the AEGIS users with limited technical background, but potentially useful for all, as it simplifies and accelerates the process of retrieving data and creating views on them. Query Builder is implemented in the form of a note inside the Apache Zeppelin notebook, using mainly PySpark for the data manipulation and Javascript and Angular JS for the user interface. The tool is directly accessible inside every newly created project. Query Builder leverages the metadata available for each file in the AEGIS system in order to provide its enhanced data selection and manipulation capabilities, i.e. enabling/disabling certain data merging and filtering options according to the data schema and also allowing the user to perform more targeted dataset exploration and retrieval based on the available metadata, through SPARQL queries performed in the background. Once opening the Query Builder, the user can select one of the available csv files. The selected file is loaded, and a preview is directly available. At all times, the user can have up to two different datasets active in Query Builder: the temporary (temp) and the master. The temporary is the one currently being used and changed, whereas the master is used as a “storage point” for intermediate results while the user is processing data and as the final result once all data manipulation is over and the user is satisfied with the outcome. A number of filters and data processing methods is available for the user to select and apply on the temporary dataset. Indicatively, the user can fill in null values, filter out entries based on values, rename columns, replace values, select/drop columns etc. The user may also merge or append the temp dataset with the master dataset. A list of selected data manipulation actions, either already applied or pending application. A preview of the data processing result is always available. When the result of a series of data processing tasks on the temp dataset is satisfactory, it can be saved as the master dataset. The user may continue processing the same temporary dataset or open a new one or, when the query creation process is complete, can save the master dataset as a new csv file or export the query and continue to directly change the generated code. This code can be leveraged (a) by the advanced user as an easily acquired starting point to further elaborate on for more complex queries and (b) by the less technically skilled user as a means to understand the underlying code and facilitate learning. Finally, the result of the data manipulation, i.e. the master dataset, can be directly passed as input to more high-level AEGIS tools, like the Visualiser and the Algorithm Execution Container.
Visualiser
The AEGIS Visualiser was designed and implemented with a two-fold purpose: (1) to provide visualisations for the analysis results as generated by Algorithm Execution Container and (2) to provide visualisations for the results generated by the queries composed and executed by the Query Builder. The first prototype is based on the Apache Zeppelin, offering the interactive web-based notebook functionality in the AEGIS platform. In addition to Zeppelin, two additional Javascript libraries were used, namely the plotly.js and the highcharts libraries. These libraries are two state-of-the-art open-source charting libraries offering a large variety of interactive charts and visualisations. The execution workflow of the Visualiser component is delivered as a Zeppelin notebook that will be accessible through the AEGIS front-end. In details, the Visualiser prototype currently supports the following two main cases: in the first case, a predefined list of visualisations is available for a list of selected datasets. More specifically, upon selecting the desired dataset, a preview of the dataset in tabular format is presented and the user is capable of selecting the preferred visualisation based on the list of available visualisations for the chosen dataset. In the second case the user is firstly presented with the list of available datasets of his project. Upon selecting the desired dataset, a preview of the dataset in tabular format is presented and the user is presented also with the list of available visualisation types. Once the desired visualisation type is selected, the user is able to set the parameters of the visualisation according to the type selected. Finally, the visualisation is generated and presented to the user.
Algorithm Execution Container
The algorithm execution container engine of AEGIS is the central computation environment where data analysis takes place and is used by data analysts in order to run the analyses required. The scope of Algorithm Execution Container is to provide data analysts with a visual UI that will help them to run analyses, without requiring them to write the analyses in specific execution languages, such as PySpark, R or Scala. As such, the data analyst can concentrate on the analysis to be executed, and not invest resources in the functional programming of the analysis. The Algorithm Execution Container environment is built on top of Zeppelin as a notebook and is interconnected to the backbone of the AEGIS platform, utilising specific data analysis and machine learning algorithms to perform the analyses. In this context, it takes as input datasets coming out of the Query Builder, or directly from the HDFS and loads a UI on top of Zeppelin. The outputs of each analysis can then be stored locally on HDFS, or (in case an algorithm is producing results that are meaningful to be visualised) forwarded to the Visualiser. Moreover, after each execution, the Algorithm Execution Container is showcasing some algorithm specific metrics which can be used to assess the performance of each algorithm.
Blog post authors: GFT