release – AEGIS Big Data https://www.aegis-bigdata.eu Thu, 30 Aug 2018 06:22:01 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.1 AEGIS Releases Platform v2 https://www.aegis-bigdata.eu/aegis-releases-platform-v2/ Thu, 30 Aug 2018 06:22:01 +0000 http://www.aegis-bigdata.eu/?p=525 Continued]]> The second release of the AEGIS integrated prototype features a partially functional high fidelity software prototype connected to a deployed version of the platform, providing an initial interface including the basic UI/UX for the users of the platform.

The GUI/Front-end has been substantially improved according to the initial feedback received, still based upon the AEGIS institutional web site [1]. When the user accesses the platform for the first time, he/her may enter the account registration, and then proceed with the login. Once the user is logged into the platform, a main page shows two tabs: one for the Assets, the other for the Public Datasets. On the right side it is showed the list of the user projects, as well as the button to creating a new one. After a project has been selected, a new page shows the related activity history and the main menu on the left side, including the following items: Get Started, Assets, Project Datasets, Query Builder, Analytics, Visualiser, Jupyter, Zeppelin, Kafka, Jobs, Metadata Designer, Settings, Members, Project Metadata. In addition, on the top-right corner of the page, a search functionality is available (see Figure 1).

Figure 1– AEGIS integrated prototype GUI

 

Below an overview of the main components of the platform is provided.

Data Harvester and Annotator

These are two tightly coupled components that facilitate the import of data and metadata from a variety of heterogeneous sources, undertaking also the necessary transformation actions towards the required data format and structure. After the initial testing, the StreamSets platform (see https://www.aegis-bigdata.eu/releasing-the-aegis-platform-v1/) proved to not be flexible enough to enable a generic use within the current ecosystem. Thus, despite being a powerful tool, StreamSets was dropped in favour of extending the software that has previously been written by Fraunhofer FOKUS.

Data Harvester Microservices

In order to tackle the mentioned problems a different approach was chosen. Instead of using a third-party software, the Java based EDP Metadata Transformer Service (EMTS) has been split up into its components, transforming the monolithic application into a microservice architecture, consisting out of the following four basic services:

  • Importer– implements all functionality for retrieving data from a specific data source
  • Transformer– converts the retrieved data from an importer into the target format of the AEGIS platform
  • Aggregator– collects converted data from a transformer over a configurable time interval
  • Exporter– uploads transformed and/or aggregated data to the AEGIS platform

To name a concrete example: Weather data may be imported hourly from an external service as JSON, which is then transformed into CSV with values being converted into the metric system. The results are then aggregated for 24 hours and the final file exported to an analytics platform.

Data Harvester Technology

In order to reuse as much as code as possible, the microservices introduced in the previous section have also been written in Java. To aid with the creation of each service the framework Eclipse Vert.x [2]has been chosen. Apart from propagating an asynchronous programming paradigm ensuring a high availability and throughput, Eclipse Vert.x also provides a lot of built in support for managing microservices. The final services are deployed via containerization, in which each service is isolated from its host and other service’s operating systems. The services are thereby packaged separately as images. The technology used for building these images is called Docker [3]. Aside from enhancing security this method also ensures platform independency, which means that these images (and consequently the services in question) will run on any infrastructure providing a Docker runtime. To handle the frictionless interaction between various microservices their APIs have been defined using the OpenAPI 3 [4] specification, which allows the precise definition of RESTful interfaces in a commonly understood format. The new approach for implementing the Data Harvester requires a specialised frontend, which allows the orchestration, configuration and execution of a specific harvesting process (pipe). This frontend is to be developed for the next release of the AEGIS platform.

Data Annotator

The Data Annotator (Metadata) was integrated into the AEGIS platform core frontend, by extending the AngularJS frontend. It allows the simple provision of detailed and semantic valuable metadata for projects, data sets and files in the AEGIS platform. In the following releases the Data Annotator will be extended to support more detailed metadata.

Cleansing Tool

In the second release of the AEGIS platform, for the data cleansing tool a two-fold approach is followed: (a) an offline cleansing tool, residing where the data are located, that provides the necessary cleansing processes and functionalities with a variety of techniques to enable data validation, data cleansing and data completion to the data prior to importing them in the AEGIS platform and (b) an online cleansing tool for data cleansing and manipulation with a set of functionalities offered during the data query creation process, addressing certain simple cleansing tasks that are time computationally intense leveraging the computational power of the AEGIS platform. Concerning the online cleansing tool, the offered functionalities are incorporated inside the Query Builder component (see the corresponding section below). The purpose of the offline cleansing tool is to provide an easily customisable and adaptable to the user’s needs tool that will enhance user experience and facilitate the users to accomplish the required cleansing tasks. The tool is implemented as a standalone application written in Python using the Flask microframework  and a set of libraries such as Pandas  and NumPy. In the first prototype version of the offline cleansing tool the rules for data validation, data cleansing and missing data handling can be set according to the user’s needs. Additionally, the user is able to review the results of the execution of the cleansing tasks through the user interface. The offline cleansing tool is also offering a REST-API with the appropriate endpoints facilitating the uploading of the dataset that will be used in the cleansing process and the execution of the cleansing tasks. The upcoming versions of the tool will contain several enhancements and updates in terms of cleansing functionalities and rules definition.

Anonymisation Tool

The anonymisation tool is an extensible, schema-agnostic plugin that allows real-time efficient data anonymisation. The anonymisation tool has been utilized for offline, private usage but offers the ability to output the anonymized data through a secured, web API. Τhe anonymization either syncs with private database servers or imports files from the filesystem and executes anonymisation functions on datasets of various sizes with little or no overhead. The purpose of the anonymisation is to allow the exploitation of the raw data in the system by accounting for privacy concerns and legal limitations.

The tool allows you to setup or edit an existing, saved configuration. By creating a new configuration the system will prompt the user to connect to the private database backend or select the file to open and select the entities/tables to anonymise. The system then prompts the user to select which fields from the data source to expose to the anonymised set, as well as the anonymisation function to be performed. The user is able to execute queries and test the anonymised output through the integrated console of the tool. The anonymised (output) data can be exposed through API to external parties in a secure way through the provision of access keys. Users can access the anonymised data in JSON format through API provided by the anonymization tool using their private access keys.

Brokerage Engine

The brokerage engine of AEGIS acts as a trusted way to record and keep a log of transactions over the AEGIS platforms, which mostly have to do with the sharing of the different data assets. The current implementation of the brokerage engine is based on the Hyperledger Fabrik open source distribution. The versions of the transaction models supported are based on the Data Policy Framework and have been optimised to reflect the main points of the Business Brokerage Framework. The Blockchain Engine is deployed on GFT servers and goes together with the overall AEGIS distribution, facilitating in this manner the transactions that are happening over the platform. Connection with the platform is based on the REST API interface exposed by the Blockchain Engine, where essential actions on the platform, such as user creation and dataset sharing are also recorded in the Blockchain ledger. The Brokerage engine, after its customisation, has been containerised with the use of Docker, in order to allow for easier and faster deployment. Moreover, this approach will allow multiple nodes of the brokerage engine to be deployed in the different AEGIS clusters that can be set up in the future, to allow the interconnection of those into a single AEGIS distributed ledger in case they are all connected to the same public network and then with specific peer and ordered keys are issued to facilitate interconnection and service discovery.

Hopsworks Integrated Services

The AEGIS platform provides data management and processing, user management, and service monitoring through the use of Hopsworks integrated services. Hopsworks introduces the notion of Project, Dataset, and User to enable multi-tenancy within the context of data management. Data processing includes data parallel processing services such as MapReduce, Spark, and Flink, as well as interactive analytics using notebooks such as Zeppelin and Jupyter. Full-text search capability is offered by the included ELK stack component (Elasticsearch). Real time analytics is enabled by the use of the included Kafka service.

Full-Text Search

Elasticsearch, one of the ELK stack components, is used to provide full text search capabilities to explore projects and datasets within the AEGIS platform. The search space available to each of the users depends on the context from which the user searches. For example, within the context of the home page, the search space includes public datasets, projects, and private datasets. When inside the project, the scope of the search is reduced to the project’s datasets including the shared datasets from other projects. Thus, the search function includes all the projects, where a user is involved, as well as all the datasets included in the projects, shared or owned by the project, as well as public datasets.

Metadata Service (formerly AEGIS Linked Data Store)

The Metadata Service is responsible for storing the metadata associated with a particular dataset within the AEGIS platform. This metadata poses the foundation of the processing of the data within the AEGIS platform. It is based on the AEGIS ontology and vocabulary. For the second release, the component was further developed, enhanced and better integrated.

Triplestore

The foundation of the Metadata Service is the Apache Fuseki Triplestore. It can be directly accessed here:

http://aegis-store.fokus.fraunhofer.de.

It offers multiple standardised Linked Data interfaces, like SPARQL or the Graph Store HTTP Protocol. These interfaces can be utilised from other components of the AEGIS platform, especially the Query Builder. Fuseki only acts as a storage layer and is not supposed to be accessed directly by the users or any other component, with an exception to the SPARQL interface, which can be used for executing complex queries against the metadata of the AEGIS platform. Therefore, the AEGIS ontologies are publicly available: http://aegis.fokus.fraunhofer.de/.

Metadata Service

The Triplestore only offers Linked Data interfaces and no rich management functionalities. Therefore, an additional service is required, providing additional functionalities for the management of the metadata. This includes particularly the straight-forward creation of metadatasets. A first prototype is available here:

http://aegis-metadata.fokus.fraunhofer.de.

It interacts with the Fuseki triplestore and offers a simple JSON-based REST-API for creating, deleting and updating metadata. It maps the JSON input to the Linked Data structures defined by the AEGIS ontology. For the second release, a basic recommendation service was implemented. It suggests suitable and similar datasets based on an input dataset. Therefore, several characteristics of the dataset are matched against the stored metadata, e.g. keywords or the semantic tabular information. For future releases, this feature will be extended and improved. The metadata service is developed in Java, based on the Play Framework.

Query Builder

Query Builder provides the capability to interactively define and execute queries on data available in the AEGIS system. It is primarily addressed to the AEGIS users with limited technical background, but potentially useful for all, as it simplifies and accelerates the process of retrieving data and creating views on them. Query Builder also offers some simple data cleansing functionalities. In its previous version, Query Builder was implemented in the form of a note inside the Apache Zeppelin notebook, using mainly PySpark for the data manipulation and Javascript and Angular JS for the user interface. In the next version, the tool switched to the Jupyter Notebook as more appropriate. The tool, potentially in slightly different flavours (to allow for more customization of the data manipulation processes) will directly accessible inside every newly created project through the Jupyter Notebook. Query Builder leverages the metadata available for each file in the AEGIS system in order to provide its enhanced data selection and manipulation capabilities, i.e. enabling/disabling certain data merging and filtering options according to the data schema and also allowing the user to perform more targeted dataset exploration and retrieval based on the available metadata. The new Query Builder version is fully integrated with the Metadata Service, hence the available datasets and corresponding information are taken from the service and presented to the user. The user can initially select an interesting dataset and then browse its files in order to choose which one to load. Once the selected file is opened, it is loaded as a Spark Dataframe, called temp dataset (tempDF), and it is available for further data manipulation. At all times, the user can have up to two different datasets active in Query Builder: the temporary (temp) and the master. The temporary is the one currently being used and changed, whereas the master is used as a “storage point” for intermediate results while the user is processing data and as the final result once all data manipulation is over and the user is satisfied with the outcome. A number of filters and data processing methods is available for the user to select and apply on the temporary dataset, through the “Controls” panel. Indicatively, the user can fill in null values, filter out entries based on values, rename columns, replace values, select/drop columns etc. The user may also merge or append the temporary dataset with the master dataset. A list of selected data manipulation actions, either already applied or pending application, is always visible under the “Selected filters” panel. When the result of a series of data processing tasks on the temporary dataset is satisfactory, it can be saved as the master dataset. The user may continue processing the same temporary dataset or open a new one or, when the query creation process is complete, can save the master dataset as a new csv file or export the query and continue to directly change the generated code. Finally, the result of the data manipulation, i.e. the master dataset, can be directly passed as input to more high-level AEGIS tools, like the Visualiser and the Algorithm Execution Container.

Visualiser

The Visualiser is the component enabling the advanced visualisation capabilities of the AEGIS platform. In accordance to the latest design and the specification of the component, the purpose of the Visualiser remains two-fold: (1) to provide visualisations of the results generated by the Algorithm Execution Container and (2) to provide visualisations of the results generated by the queries composed and executed by the Query Builder.

The Visualiser through its intuitive and easy-to-use user interface is offering a variety of visualisation formats which spans from simple static charts to interactive charts with multiple layers of information and several customisation options. The current implementation of the Visualiser component supports the following visualisation types:

  • Scatter plot
  • Pie chart
  • Bar chart
  • Line chart
  • Box plot
  • Histogram
  • Time series
  • Heatmap
  • Bubble chart
  • Map

The Visualiser is implemented as a predefined Jupyter [5]notebook. In addition to Jupyter, two Python open source libraries were utilised, namely the Folium [6]and the highcharts [7]libraries. Within the AEGIS platform the Visualiser can be accessed through Jupyter, that is integrated as a service within the AEGIS Front End. The implementation of the execution workflow of the Visualiser component is explained in the following steps:

  1. At first, when the Visualiser is loaded the user is presented with a list of options in order to define the dataset that will be utilised for the visualisation creation. The user is able to navigate through the list of available datasets within the project’s folders and select the desired dataset. Upon selecting the desired dataset, a preview of the dataset in tabular format is presented to the user.
  2. In the next step, the user is presented with the list of available visualisation types. Once the desired visualisation type is selected, the user is presented with the list of available parameters for the specific visualisation type. The list of parameters includes a variety of options that spans from the variables that will be used in the visualisation and the titles that will be displayed in the visualisation axis to the selected visualisation’s type specific parameters such as the aggregation function or class variable.
  3. Once the visualisation type has been selected and the corresponding parameters have been set, the user can trigger the visualisation creation.

Algorithm Execution Container

The Algorithm Execution Container is an interface for data analysis that resides on top of Zeppelin which is one of the most popular notebooks used by data analysts. The overall concept of this module is to accelerate analysis execution by simplifying the steps that data analysts perform, through eliminating the need to author code directly into the notebook. The current implementation of Algorithm Execution Container is based on Angular JS and Python. In terms of algorithms, the container exploits the MLlib machine learning library and exposes 16 different algorithms, which are grouped in 5 different algorithmic families. Upon launch of the container, the Spark interpreter of the AEGIS platform is fired up to power the code that needs to be executed by the Zeppelin notebook. The container takes as input a dataset which has to be formatted accordingly to be ready to be used by the selected algorithm. A point to the specific URL of the input dataset is provided, which has to be a dataset stored in the AEGIS Data Store and should be accessible to the user that is performing the analysis. The user is then able to select the algorithmic family and the specific algorithm to run, and he is presented with the parameters that are relevant to the analysis, so he can provide his preferences. Upon execution, the necessary Zeppelin paragraphs are executed in the background and the user is presented with the final result of his analysis. Simple, predefined visualisation options are also provided by the Zeppelin notebook. The final output of the analysis is automatically stored back in the AEGIS Data Store, while the model that was used for the analysis, is also stored alongside with the analysis results. According to the project development plan, the next version of the Algorithm Execution Container is going to be deployed over the Jupiter notebook, thus allowing the coverage of the two most popular notebook implementations available, while it will allow for the creation of a unified analytics flow within the platform by interconnecting the container to the Query Builder and the Visualiser.

References

[1] https://www.aegis-bigdata.eu/

[2] https://vertx.io/

[3] https://www.docker.com/

[4] https://github.com/OAI/OpenAPI-Specification

[5] http://jupyter.org/

[6] http://folium.readthedocs.io/en/latest/

[7] https://www.highcharts.com/

 

Blog post authors: GFT

]]>
Releasing the AEGIS Platform v1! https://www.aegis-bigdata.eu/releasing-the-aegis-platform-v1/ Thu, 01 Mar 2018 09:00:53 +0000 http://www.aegis-bigdata.eu/?p=437 Continued]]> Today we are very happy to announce the release of the AEGIS platform v1.

The AEGIS first integrated prototype features a low fidelity, functional mockup version of the AEGIS platform’s backbone, providing a meaningful subset of the functionalities characterizing the devised MVP. It is delivered for early assessment by the end users. A preliminary design work has been performed by realizing a set of mockups showing the main functionalities of the platform in terms of how a user will perform the tasks through the AEGIS platform. In order to accomplish this task, the Ninjamock web tool has been adopted. Ninjamock is a collaborative service for creating and sharing wireframes, allowing users to share projects through private URLs and to export wireframes in PDF format. GitHub has been selected as software development platform, allowing all of the distributed version control and source code management functionality of Git as well as adding its own features (i.e. access control and several collaboration features such as bug tracking, feature requests, task management, and wikis). The AEGIS GitHub repository is available at the following URL: https://github.com/aegisbigdata.

A TESTBED machine has been provided, assigning to the virtual machine’s 8 cores and 60GB, necessary to efficiently run the environment. Testing accounts on the TESTBED machine for the partners have been created. An integration plan was prepared to guide the integration of the backbone infrastructure with the various services and components, focusing on a continuous integration process. At the same time, it was defined a strategy for verification and testing of this platform and of its components: the AEGIS partners responsible for each platform component performs development activities in local machines, test the component [unit test] and release it on GitHub according to the commonly agreed release plan. The system integrator and all the interested technical partners integrate the web components from the AEGIS front-end locally, testing included [integration testing]. After that, the system integrator releases the updates on the AEGIS Github repository. Then a selected partner will deploy the above updates on the TESTBED machine. For each workflow identified in an earlier task of the project, the responsible partners will test their respective workflow on the TESTBED machine [system testing]. After that, the TESTBED machine will be available to the demonstrators [acceptance testing].

Building on top of Hopsworks, a GUI/Front-end has been released, following the look and feel of the AEGIS institutional web site, using as main technologies HTML and AngularJS. When the user accesses the platform for the first time, he/her may enter the account registration, and then proceed with the login. Once the user is logged into the platform, a main page shows on the right side the list of the currently available projects, as well as the button to creating a new one. After a project has been selected, a new page shows the related activity history and the main menu on the left side, including the following items: Data Sets, Queries, Metadata Designer, Visualizations, Analytics and Settings. In addition, on the top-right corner of the page, a search functionality is available (see Figure 1).

Figure 1 - AEGIS first prototype GUI

Figure 1 – AEGIS first prototype GUI

Below an overview of the main components of the platform is provided.

Data Harvester and Annotator

The Data Harvester and Data Annotator are loosely coupled with the central AEGIS platform and for development reason not yet fully integrated in the deployment process.

Metadata Ontology

The basis of these components is the AEGIS metadata ontology and vocabulary. Its objective is to describe the available datasets in a detailed fashion in order to support any further processing, especially visualisations and query building. The ontology is based on the DCAT-AP specifications and extended accordingly. The core concepts of DCAT-AP were mapped to the corresponding concepts and data structures of the AEGIS platform in the following way:

  • DCAT-AP Catalogues à AEGIS Projects
  • DCAT-AP Datasets à AEGIS Data Sets
  • DCAT-AP Distributions à AEGIS Files

In addition, the focus of the first prototype is to simply and effectively describe the semantics of tabular data. Therefore, an AEGIS-specific set of Linked Data classes and properties was developed. The overall objective was its simplicity and reusability. The first iteration of the ontology is available on GitHub as Turtle-serialised RDF file.

Data Harvester

The Data Harvester is the component enabling the import of data from heterogeneous sources and their transformation to the required data format of the AEGIS platform. For the first prototype the Java-based solution StreamSets Data Collector was employed and adopted to allow the straight-forward composition of harvesting pipelines.

In order to publish data on the AEGIS platform a plugin for StreamSets was developed[3], which enables the users to choose AEGIS as a destination for a data pipeline. The prototype was successfully tested with retrieving data from CSV-files and transferring them to the AEGIS platform. With the integrated processing tools of StreamSets, rich pre-processing and aggregating of several data sources can be performed. In addition, complex transformation tasks can be done by applying Java Script or Python scripts within the StreamSets web application interface. A first pipeline was implemented, harvesting the data from various CSV files to the AEGIS platform. StreamSets does not offer a convenient feature, for scheduling harvesting pipelines, but a REST-API for starting, stopping and monitoring them. To overcome this downside and allow better integration into the AEGIS platform, the Java-based EDP Metadata Transformer Service was adopted to allow the scheduling and management of StreamSets pipelines.

The tools let the users add REST-endpoints of StreamSets and configure a scheduling plan for their execution. It offers a REST-API as well, allowing the immediate integration of the functionalities into the AEGIS platform.

Annotator

The annotator is the component in the AEGIS platform that is responsible for interactively equipping input data with suitable metadata. It is not developed yet, but its functionality is already outlined in an interactive mockup. The annotator will be an interactive metadata editor with a graphical user interface. It will be based on DCAT-AP and the AEGIS ontology and tightly interact with all components of the AEGIS data value chain. Within the data value chain, the annotator will be available once a dataset is present in the AEGIS platform.

The beforehand described services and tools will be integrated into the AEGIS platform. The annotator user interface and the scheduling functionalities will be directly available from the AEGIS frontend. Hence, they will be integrated into the Angular.js frontend application. This will include communication to the respective backend counterparts, like the AEGIS Linked Data Store.

Cleansing Tool

The AEGIS Cleansing Tool has been developed according to a two-fold approach that includes (a) simple data cleansing transformations to be applied through existing mature tools offline and (b) more complex data cleansing (e.g. outlier detection and removal) through dedicated cleansing processes that could be implemented through the Algorithm Execution Container. Furtherly, in order to provide a more intuitive user experience and also leverage the computational power of the AEGIS system, it was decided to make certain simple data cleansing functionalities available to the user during the data query creation process, i.e. when he should be more confident about the desired data manipulation needed to perform in order to use the data for further analysis. Hence, the cleansing tool for the first AEGIS prototype is incorporated in the Query Builder.

Anonymisation Tool

The anonymization tool is an extensible, schema-agnostic plugin that allows real-time efficient data anonymisation. The anonymization tool has been utilized for offline, private usage but offers the ability to output the anonymized data through a secured, web API. With emphasis on performance, the anonymization syncs with private database servers and executes anonymization functions on datasets of various sizes with little or no overhead. The purpose of the anonymization is to enable the potential value of raw data in the system by accounting for privacy concerns and legal limitations. The anonymization system comes with a list of predefined anonymization functions that can be used directly (e.g. city from an exact address, range of values from an integer), as well as a list of aggregation functions (e.g. average). Besides, the tool can be easily extended with any new, custom anonymization function defined by the user in a python module. The user is able to execute queries and test the anonymized output through the integrated console of the anonymization tool. The anonymized, output data can be exposed through API to external parties in a secure way through the provision of access keys. Users can access the anonymized data in JSON format through API provided by the anonymization tool using their private access keys.

Brokerage Engine

The brokerage engine of AEGIS acts as a trusted way to log transactions over the AEGIS platform which mostly have to do with the usage of the different data assets. The brokerage engine is based on the Hyperledger Fabrik implementation and the initial versions of the models supported are based on the Data Policy Framework. These models are expected to be updated in the next release of the platform and also the most important descriptors to be exposed through the AEGIS metadata editor, in order to accompany each data asset with the necessary model attributes for the brokerage engine to perform. Upon execution of each transaction that will be linked to the distributed ledger, a specific block is created and the transaction is then validated and appears in the ledger. The current implementation of the brokerage engine platform exposes a REST API which will be integrated to the AEGIS core platform in order for the latter to call the brokerage engine and perform any necessary writes/reads to/from the blockchain infrastructure.

Hopsworks Integrated Services

The AEGIS platform uses Hopsworks integrated services for data management and processing. Hopsworks provides a multi-tenant project platform for secure collaborative data management between users. It provides an integrated support for different data parallel processing such as MapReduce, Spark, and Flink, and interactive notebook using Zeppelin, as well as other ELK stack components such as Elasticsearch. Hopsworks introduces the notion of Project, Dataset, and User. A Dataset is a directory in the underlying file system (HopsFS) that could contains as many files and directories, and it could be shared between projects. A Project is a collection of datasets, notebooks, users, and jobs (Flink, Spark, Flink). A User in Hopsworks can create multiple project, share projects with other users, create datasets, and upload/download/preview data in his/her dataset.  Below we show the main components provided by Hopsworks.

Users API

A new user of the platform is required to first register with AEGIS, by providing: first name, last name, email, password and a security question/answer. Once this step is performed, an AEGIS administrator can search and validate the new account with the appropriate roles. Once the account is active, the user can log in with her credentials on the AEGIS platform.

Projects and Datasets

Once the user is logged in, he/she can start using the AEGIS platform by first creating a project. The user provides the project name, a description, and potential list of members. Then, the user can navigate to the project ‘AEGIS’ where a list of existing datasets is provided. Inside the project, on the left bar there are tools that the users can use to interact/process their data such as Queries, Visualizations, and Analytics. To start interacting with these tools, the users first create their own datasets, then upload their data inside. Inside the dataset, the users can create folders, upload files, and then navigate through the created structure. The uploaded files can then be viewed on the platform, or downloaded.

ELK Stack

Hopsworks includes support for a number of the components of the ELK stack, such as Elasticsearch, Kibana and Logstash. Out of these services, the first prototype of the AEGIS platform utilizes the Elasticsearch component to provide full text search capabilities to explore projects and datasets. The search space available to each of the users depends on the visibility of the datasets. Once the visibility of a dataset is set to public, it will be part of the search space of any user of the AEGIS platform. Datasets can also be shared directly between two projects. Thus, the search function includes all the projects, where a user is involved, as well as all the datasets included in the projects, shared or owned by the project, as well as public datasets.

AEGIS Linked Data Store

The AEGIS Linked Data is responsible for storing the metadata associated with a particular dataset within the AEGIS platform. This metadata poses the foundation of the processing of the data within the AEGIS platform. It is based on the AEGIS ontology and vocabulary. It is composed out of two components: a Triplestore and a thin management layer for convenient access to the Linked Data. For the first prototype the Apache Fuseki was deployed as a Triplestore. It offers multiple standardised Linked Data interfaces, like SPARQL or the Graph Store HTTP Protocol. These interfaces can be utilised from other components of the AEGIS platform, especially the Query Builder. The Triplestore only offers (complex) Linked Data interfaces and no rich management functionalities. Therefore, an additional service is required, providing additional functionalities for the management of the metadata. This includes particularly the straight-forward creation of metadatasets. It interacts with the Fuseki triplestore and offers a simple JSON-based REST-API for creating, deleting and updating metadata. It maps the JSON input to the Linked Data structures defined by the AEGIS ontology. The AEGIS Linked Data Store requires tight integration into the AEGIS platform, since the metadata is present throughout the entire data value chain, from providing until visualizing data. For creating the metadata the Annotator will be integrated into the AEGIS Angular.js frontend and communicate to Linked Data Store via the metadata service. Since the metadata and the actual data are store in different service a synchronisation mechanism will be integrated, ensuring that for each metadata record the respective data is available and vice versa. This will be done by implementing hooks into the AEGIS core platform, which fire events to the metadata service whenever data is modified or deleted.

Query Builder

Query Builder provides the capability to interactively define and execute queries on data available in the AEGIS system. It is primarily addressed to the AEGIS users with limited technical background, but potentially useful for all, as it simplifies and accelerates the process of retrieving data and creating views on them. Query Builder is implemented in the form of a note inside the Apache Zeppelin notebook, using mainly PySpark for the data manipulation and Javascript and Angular JS for the user interface. The tool is directly accessible inside every newly created project. Query Builder leverages the metadata available for each file in the AEGIS system in order to provide its enhanced data selection and manipulation capabilities, i.e. enabling/disabling certain data merging and filtering options according to the data schema and also allowing the user to perform more targeted dataset exploration and retrieval based on the available metadata, through SPARQL queries performed in the background. Once opening the Query Builder, the user can select one of the available csv files. The selected file is loaded, and a preview is directly available. At all times, the user can have up to two different datasets active in Query Builder: the temporary (temp) and the master. The temporary is the one currently being used and changed, whereas the master is used as a “storage point” for intermediate results while the user is processing data and as the final result once all data manipulation is over and the user is satisfied with the outcome. A number of filters and data processing methods is available for the user to select and apply on the temporary dataset. Indicatively, the user can fill in null values, filter out entries based on values, rename columns, replace values, select/drop columns etc. The user may also merge or append the temp dataset with the master dataset. A list of selected data manipulation actions, either already applied or pending application. A preview of the data processing result is always available. When the result of a series of data processing tasks on the temp dataset is satisfactory, it can be saved as the master dataset. The user may continue processing the same temporary dataset or open a new one or, when the query creation process is complete, can save the master dataset as a new csv file or export the query and continue to directly change the generated code. This code can be leveraged (a) by the advanced user as an easily acquired starting point to further elaborate on for more complex queries and (b) by the less technically skilled user as a means to understand the underlying code and facilitate learning. Finally, the result of the data manipulation, i.e. the master dataset, can be directly passed as input to more high-level AEGIS tools, like the Visualiser and the Algorithm Execution Container.

Visualiser

The AEGIS Visualiser was designed and implemented with a two-fold purpose: (1) to provide visualisations for the analysis results as generated by Algorithm Execution Container and (2) to provide visualisations for the results generated by the queries composed and executed by the Query Builder. The first prototype is based on the Apache Zeppelin, offering the interactive web-based notebook functionality in the AEGIS platform. In addition to Zeppelin, two additional Javascript libraries were used, namely the plotly.js and the highcharts libraries. These libraries are two state-of-the-art open-source charting libraries offering a large variety of interactive charts and visualisations. The execution workflow of the Visualiser component is delivered as a Zeppelin notebook that will be accessible through the AEGIS front-end. In details, the Visualiser prototype currently supports the following two main cases: in the first case, a predefined list of visualisations is available for a list of selected datasets.  More specifically, upon selecting the desired dataset, a preview of the dataset in tabular format is presented and the user is capable of selecting the preferred visualisation based on the list of available visualisations for the chosen dataset. In the second case the user is firstly presented with the list of available datasets of his project. Upon selecting the desired dataset, a preview of the dataset in tabular format is presented and the user is presented also with the list of available visualisation types. Once the desired visualisation type is selected, the user is able to set the parameters of the visualisation according to the type selected. Finally, the visualisation is generated and presented to the user.

Algorithm Execution Container

The algorithm execution container engine of AEGIS is the central computation environment where data analysis takes place and is used by data analysts in order to run the analyses required. The scope of Algorithm Execution Container is to provide data analysts with a visual UI that will help them to run analyses, without requiring them to write the analyses in specific execution languages, such as PySpark, R or Scala. As such, the data analyst can concentrate on the analysis to be executed, and not invest resources in the functional programming of the analysis. The Algorithm Execution Container environment is built on top of Zeppelin as a notebook and is interconnected to the backbone of the AEGIS platform, utilising specific data analysis and machine learning algorithms to perform the analyses. In this context, it takes as input datasets coming out of the Query Builder, or directly from the HDFS and loads a UI on top of Zeppelin. The outputs of each analysis can then be stored locally on HDFS, or (in case an algorithm is producing results that are meaningful to be visualised) forwarded to the Visualiser. Moreover, after each execution, the Algorithm Execution Container is showcasing some algorithm specific metrics which can be used to assess the performance of each algorithm.

Blog post authors: GFT

 

]]>