The conceptual architecture of AEGIS platform was the outcome of the analysis and translation of the technical requirements collected by the consortium following the Agile methodology and the linking of these requirements into functionalities of the components. These components formed the holistic AEGIS architecture for the first version of the platform making sure that all requirements were addressed by one component or a set of components collaboratively. As the project evolved, the consortium further analysed the conceptual architecture along with the design of the various components and introduced the necessary updates and refinements towards the aim of providing a more solid and concrete architecture that will assure that the identified technical requirements are covered. The conceptual architecture is designed in a modular way and comprises a list of key components.
The following paragraphs contain the updated designs and technical specifications for each of the key components of the architecture.
Data Harvester and Annotator are two tightly coupled components composing the AEGIS Harvester component that facilitates the import of data and metadata from a variety of heterogeneous sources, undertaking also the necessary transformation actions towards the required data format and structure. Within the context of AEGIS Harvester, several logical subcomponents, that can be seen also as self-contained resources represented as API endpoints, are orchestrated towards an end-to-end process of importing data and metadata for the AEGIS platform. In particular, the AEGIS Harvester component is composed by the Repository, the Annotation, the Transformation, the Harvester and the Run resources. Each one of the resources undertakes a specific responsibility within the process. The foundation of the AEGIS Harvester is the open source solution for harvesting metadata from diverse Open Data sources, the EDP Metadata Transformer Service (EMTS) , which is based on technologies like JavaEE , PicketLink, Quartz Scheduler, Apache Jena, JavaServer Faces, Bootstrap 3 and WildFly. In addition to the EMTS, the open source Java-based solution StreamSets Data Collector (SDC) is explored in order to facilitate the building of data flows from arbitrary data sources to arbitrary data destinations.
The AEGIS Cleansing Tool is an optional component that resides where the data are located before they become available in the AEGIS platform. The Cleansing Tool is following a two-fold approach that covers: a) simple data pre-processing by applying predefined data transformation rules towards data cleaning via a set of offline tools, such as TRIFACTA Wrangler, sampleclean and, for small scale transformations, OpenRefine and Streamsets Data Collector, and will be offered to the platform’s users prior to importing their data to the platform, b) more advanced processes, such as outliers’ detection and elimination from a dataset or data imputation, through dedicated cleansing processes available within the Algorithm Execution Container leveraging the AEGIS processing and analytical capabilities towards coding an advanced and fully customised data cleansing process via the integrated services of Jupyter and Apache Zeppelin notebooks.
The Anonymisation tool also resides where the data to be anonymised are located and is a tool that is external to the core AEGIS platform. The purpose of this tool is to ensure that the potentially sensitive (personal or corporate) data will be anonymised on the company premises before they are uploaded in the AEGIS platform, thus eliminating any possible vulnerabilities risks and safeguarding that information that should not be disclosed to third parties remains safe. The Anonymisation tool provides connectivity to various data sources locally, such as PostgreSQL, MySQL and csv files, and offers a variety of anonymisation alternatives (generalisation, k-anonymity, pseudonimity) according to the user’s intended usage of the anonymised dataset and the data schema of the dataset and supports the export of the produced anonymised data in files and as RESTful services. The foundation of the Anonymisation tool will the Anonymiser, an anonymisation and persona-building tool, developed in the context of the European project CloudTeams. This tool is written in Python using the Django web framework and currently performs a type of generalisation towards the achievement of k-anonymity and a pseudonimity functionality. The tool will receive the necessary updates and refinements in order to support the envisioned functionalities and requirements described above. Moreover, other open-source solutions may be explored, such as the ARX , as complementary tools.
The AEGIS Brokerage Engine is the component responsible for applying the policies for read/execution permissions in relation with the various artefacts of the AEGIS platform as described in the Data Policy Framework (DPF) and for the recording of the actions performed in a distributed ledger supported by a blockchain implementation. The recorded transactions include uploading a dataset or algorithm, executing an analysis or visualisation, downloading or copy/replicating a dataset or algorithm and/or result of analysis or visualisation and finally the usage of a dataset or algorithm for conducting analysis. The Brokerage Engine is utilising the data policy metadata as provided by the Annotator in order to perform the policy checks. The Brokerage Engine consists of two brokers, the Artefact Policy Broker and the Artefact Business Broker. The Artefact Policy Broker is monitoring the transactions performed in the AEGIS Data Store in order to perform the necessary checks in accordance with the relevant data policies of each artefact and provide the information for the allowed activities for each artefact. If the transaction is allowed, the Artefact Business Broker is taking over in order to record the transaction in the distributed ledger. The Brokerage Engine is based on a hybrid approach where Python is used to implement part of the DPF in conjunction with the Hyperledger Fabric for the blockchain part of the engine. The implementation will be supported by the API methods provided by Hopsworks in order to monitor the related events and store them in the ledger along with the data policy metadata of each artefact.
The AEGIS Data Store is the storage provider of the AEGIS platform. AEGIS Data Store is responsible for storing the data that is collected and curated by the AEGIS Harvester. Due to the Big Data nature of the platform, the optimal solution is a fast, reliable, scalable distributed file system and as consequence the HopsFS was adopted. HopsFS is a new distribution of the Hadoop Distributed File System (HDFS) overcoming the HDFS limitation using a distributed metadata service built on a MySQL Cluster database that enables an order of magnitude larger and higher throughput clusters compared to HDFS by adapting a multiple stateless namenodes architecture. HopsFS clients are providing better load balancing between the namenodes and an extended set of API methods. HopsFS provides a RESTful API named WebHDFS with a variety of method in order to manipulate the namespace.
In addition to the HopsFS, the AEGIS Linked Data Store is providing the metadata storage capabilities for the datasets that are imported into the AEGIS platform. The AEGIS Linked Data Store is developed as an integrated service of the AEGIS Data Store. For each dataset, the corresponding metadata containing detailed information about the semantics and the syntax of the data itself is stored. This metadata is the foundation of the data processing functionalities of the AEGIS platform. The AEGIS ontology and vocabulary, that is based upon the DCAT-AP specifications, is utilised for the metadata of each dataset and are stored as Linked Data. The basis for the AEGIS Linked Data Store is the Apache Jena Fuseki Triplestore and the access to the SPARQL endpoints of Fuseki is offered by a lightweight Java application based on Apache Jena through a RESTful-API.
The AEGIS platform incorporates the Hopsworks Integrated Services, which encapsulates of a set of services providing the resource management service of the platform, as well as the data management and processing services of the platform. The resource management service, namely the HopsYARN, is undertaking the responsibility of the resource management of the cluster via a distributed stateless mechanism that safeguards no down-time of the cluster in addition to efficient resource management with consistent operations, security and data governance tools. Running on top of HopsFS and HopsYARN, the multi-tenant data management platform Hopsworks is offering an integrated view over the files and services, while also providing an intuitive design supporting the addition of extended attributes to files/directories/datasets that enable their search within the whole cluster, project or dataset. Hopsworks offers an integrated support for different data parallel processing services such as Spark, Flink, and MapReduce, as well as a scalable messaging bus with Kafka, and interactive notebooks with Zeppelin and Jupyter. Moreover, Hopsworks offers a user management service, namely the UserManagement service, a data transfer and search service, namely the Dela service, as well as a monitoring service, namely the KMON service.
The Visualiser is the component that provides the advanced visualisation capabilities of the AEGIS platform for the output of the analysis results originating from the Algorithm Execution Container component or the query processing results originating from the Query Builder component. The Visualiser is composed by several logical subcomponents implemented as separate services, namely the Recommender, the Selector, the Configurator, the Generator, the Dashboard Compiler and the Dashboard Publisher services. Each service is supporting separate step of the visualisation process execution with the aim of providing advanced and intuitive visualisations to the user. The Visualiser is also developed as a widget inside the Apache Zeppelin notebook. The Visualiser is utilising the API provided by Zeppelin in order to provide the corresponding configuration to the Zeppelin core engine or interpreters and also leverage the out-of-the-box visualisations offered by Zeppelin, such as tabular data, bar chart, area chart, line char, pie chart and scatter plot. Moreover, the Visualiser is offering a large variety of advanced interactive visualisations using the Python libraries matplotlib, ggplot and plotly.
The Algorithm Execution Container will enable the analysis execution over selected data within the AEGIS platform. The Algorithm Execution Container is an upper-layer of the algorithm execution methods and provides extra functionalities to both novel and non-expert users. It offers an algorithm selection template, supplemented with basic information presentation for each algorithm available in the platform. Moreover, it offers the algorithm parameterisation with a list of specified parameters for each algorithm enabling the users to better understand and trim their analyses. The results of the analyses will be presented to user and will offered as an input to the Visualiser component. This component is also developed as a widget inside the Apache Zeppelin notebook and is based on Python. The Apache Zeppelin notebook will be extended with an easy to use algorithm library, containing a list of preconfigured algorithms and with an exposed list of selected configurable parameters that will be provided to the users.
The AEGIS Orchestrator is undertaking the responsibility of integrating the various services and components of the AEGIS platform, specifically the services included within the context of the Hopsworks Integrated services with the rest of the services of the platform. The AEGIS Orchestrator is building the automated and managed flow of information between the various components and services by leveraging their exposed APIs. It is enabling “data pipes” of information from different sources and manages the rules on how that information is transmitted and modified. The AEGIS Orchestrator is also providing a list of basic conversions in order to facilitate the integration of the components and services. The AEGIS Orchestrator is based on Java 8 and the Spring Framework. Additionally, the Java template engine Thymeleaf is explored.
The AEGIS Front-End is providing the AEGIS platform users the user-friendly graphical interface to navigate through the platform easily while exploiting its services. It is the upper layer of the whole AEGIS architecture. The AEGIS Front-End is facilitating the platform web components requiring an interaction with the user (e.g. Harvester, Query Builder, Visualiser, Orchestrator) by providing a common user interface with direct links to the single web components and their graphical interfaces. The AEGIS Front-End is built on top of the Hopsworks and is based on the AngularJS framework, the Java Server Faces (JSF) and the PrimeFaces, an implementation of the JSF.
Blog post authors: UBITECH