methodology – AEGIS Big Data

The AEGIS Architecture & Technical Requirements: Methodology & Produced Results

aegis-admin — Wed, 17 Jan 2018 10:16:19 +0000

The AEGIS platform promises a novel approach in data sharing and exchanging: a safe environment for the stakeholders to sell and purchase datasets, data services, algorithms or intelligence reports. In order to reach its objective AEGIS needs to understand the ambient in which it operates and the needs of its stakeholders, so as to offer a set of services adding value to this value chain.

Towards reaching this objective, AEGIS analysed the methodologies available for requirements engineering so as to decide upon the most relevant and most efficient one within the context of the project. From this state of play analysis, the AEGIS consortium concluded that the most suitable, most appropriate and most efficient when adopted methodology is the Agile development methodology. The Agile methodology provides opportunities to assess the direction of a project throughout the development lifecycle. It is people-focused and communications-oriented under the light that requirements cannot be fully collected at the beginning of the software development cycle. On the contrary, every aspect of development — requirements, design, etc. — is continually revisited to have time to steer it in another direction. The Agile planning is adaptive, focused on quick responses to change through continuous development and improvement.

The whole product in Agile methodology is broken into small increments that minimize the overall risk and allow the product to adapt to changes quickly. Iterations are short time frames (called timeboxes), each involving a cross-functional team working in all functions: planning, analysis, design, coding, unit testing, and acceptance testing. The requirements/tasks to be done in the timebox are defined at the beginning of iteration and agreed upon by the team. Another key concept in Agile development is the close collaboration between the development team and the business stakeholders. In accordance with that, the requirements elicitation was achieved adopting the principles of user stories, coming directly from the demonstrator partners.

Figure 1. Requirement Engineering

The Agile-compliant phases defined for the engineering of the AEGIS user requirements are graphically illustrated in Figure 1. After the AEGIS stakeholders have been identified, with their high-level goals, objectives and needs being properly defined, the User Stories need to be elicited. Two of the most popular methods for eliciting user stories are through workshops and structured and/or semi-structured interviews. Towards this end, within the context of AEGIS, such workshops and interviews were performed with the pilot teams by the technical partners in order to obtain the relevant user stories.

User stories are a very high-level definition of requirements, describing a feature told from the end user’s perspective (i.e. who desires the new capability), usually a user or customer of the system. A user story is short, generally one-sentence, but it contains enough information to describe the requirement, so that the developers can produce a reasonable estimation of the effort to implement it. A user story typically follows a simple template: As a < user-type>, I want to so that . User stories are written throughout the Agile project. Usually a story-writing workshop is held close to the start of the Agile project. Additionally, new stories can be written and added every iteration.

The collected user stories were in turn analysed by the technical partners so that the user requirements could be extracted, aggregated and properly analysed for consistency. In order to come up with a proper listing of requirements, these requirements need to be properly classified. Towards this end, within the context of AEGIS, the requirements were split taking into consideration two main attributes: 1) whether they belong to the core platform, or whether they are demonstrator specific, and 2) whether they constitute functional or non-functional requirements.

Functional requirements are one of the most well-known types of requirements, defining the required behaviour of the system to be built, as reported by a hypothetical observer envisioning the inputs that the future system will accept and the outputs it will produce in response to those inputs, e.g., they define the capabilities that a product must provide to its users. Functional requirements are based on system objectives and respond to the critical task of ensuring the right implementation of the expected functionality in the final software. One of the main tools to achieve this goal is system testing, i.e., a mechanism to verify that the system performs the behaviour expressed in its requirements.
Non-functional requirements specify additional properties of AEGIS, other than functionality. These requirements can be subcategorized into categories such as performance, design constraints (that can also be categorized under external interface requirements), logical database requirements, and “characteristics” (termed “attributes” in IEEE Std. 830) that don’t fit neatly into any of the other categories. The Non-Functional Requirement can also describe quality attributes, design and implementation constraints that the product must have, thus they are more qualitative and may require a different approach for their elicitation. To identify the not functional requirements, the model proposed by ISO/IEC 25010:2011 was adopted. Following that model there are eight quality characteristics contributing to software product quality. The ISO/IEC 25010:2011 Software Product Quality model is illustrated in Figure 2.
Core Requirements refer to all requirements associated with, and addressed by the AEGIS platform. They are not binded with a specific demonstrator but cater for the current, and future needs associated with the services AEGIS aims to offer to all stakeholders horizontally. All core processing tasks represented by Core Requirements will be available to all pilot/stakeholders.
Demonstrator requirements are the requirements of each specific demonstrator. Demonstrator requirements refer to actions performed by the users or to processes supported by the applications to be developed on top of AEGIS.

Figure 2. ISO/IEC 25010:2011 Software Product Quality model

A table was thus created to document these extracted requirements. An extract of these requirements follows and is presented in Table 1 below.

Table 1. Requirements Listing

Id	Requirement Type	Requirement Nature	Requirement Description
UBI1	Core	Functional	AEGIS should be able to process sensor data, including environmental (indoor and outdoor), occupancy sensing and Air Quality monitoring, from installed physical devices.
KTH1	Demonstrator	Functional	Users should be able to choose a service to work on their data
GFT2	Core	Functional	AEGIS needs to have a data/knowledge base to handle traceability and submitted issues status

The final list of user requirements was produced after having been validated within the context of a dedicated workshop involving pilot and technical partners. An extract of these requirements follows and is presented in Table 2 below. The provided extract refers only to core functional requirements.

Table 2. Core Functional Requirements

Id	Description (Detailed description of the requirement)
Analytics
CFR22	AEGIS should be able to process daily routines as self-reported from users or automatically extracted by wearables
CFR4	AEGIS has to support many analysis types (e.g. estimation of correlations between variables, linear regression, predictive analysis, clustering algorithms, simulations)
CFR41	AEGIS should be able to process sensor data (including environmental (temperature/humidity/luminance), occupancy sensing and Air Quality monitoring) from installed physical devices
Correlation
CFR10	AEGIS has to correlate datasets to geospatial data with their description
CFR11	AEGIS should work simultaneously with public and private (customers) data
CFR28	AEGIS should be able to correlate positional information (and additional information from wearables) with Public Health Information data and announcements, taking into account also time
Security / Privacy
CFR6	AEGIS has to implement security mechanisms as well as proper handling of privacy issues (e.g. in case of using private datasets)
CFR7	AEGIS needs to display different levels of information depending on who is accessing to the data
CFR8	AEGIS should allow the creation of the different users/groups and access rights for authorized system user

The extraction and analysis of the requirements led to the design of the AEGIS architecture, which of course followed a number of iterations prior to reaching its final design. The stable, final version 1 design of the AEGIS architecture is graphically illustrated in Figure 3. Each of the functional and non-functional requirements translated into technical requirements were mapped into functionalities of components, so that the set components comprising the holistic AEGIS architecture cover the complete set of functional, non-functional and technical requirements.

Figure 3. AEGIS conceptual architecture

The core of the AEGIS platform is its Big Data Processing Cluster. For this cluster, the project consortium opted to exploit the Hops platform[1]. Hops is a next-generation distribution of Apache Hadoop[2] supporting Hadoop as a Service, project-Based Multi-Tenancy, secure sharing of datasets across projects, extensible metadata that supports free-text search using Elasticsearch[3] and YARN[4] quotas for projects among other features and services. Hopsworks is the User Interface built around Hops providing graphical access to the integrated services such as Spark, Yarn, Elasticsearch, Kafka and Apache Zeppelin. HopsYARN is undertaking the responsibility of the resource management of the cluster. HopsYARN is a distributed stateless Resource Manager that enables Hops to have no down-time, providing also efficient resource management with consistent operations, security and data governance tools.

The storage capabilities of the platform will be provided by the AEGIS Data Store. The AEGIS Data Store will be based on HopsFS, a new implementation of the Hadoop Filesystems (HDFS) provided by Hops. HopsFS enables more scalable clusters as it supports multiple stateless NameNodes where the metadata are stored in an in-memory distributed database increasing performance dramatically. It should be noted that the AEGIS Data Store will be also extended with additional storage solutions, such as solutions for storing linked metadata.

The AEGIS Data Harvester consists of all applications enabling the import of the data and metadata of the original data set, along with any possible transformations required. Since the original data sets provided may be in multiple formats it is essential that a wide spectrum of possible forms of data are supported. The AEGIS Data Annotator component refines the output of Data Harvester with the main purpose being the enrichment of the metadata or linked data using predefined ontologies and vocabularies. Through the semantic annotation, the concepts included in the selected subset will be related to well defined semantics. Semantic annotations provide information ‘about’ the data, for example the meaning or what the data is about and the available semantic relationships from a domain model in which the data is defined. The purpose of semantically annotating a dataset is to create a context in terms of the content and functionality of the data so that it can be easily interpreted, combined and reused by computers.

The AEGIS Brokerage Engine is responsible for applying the policies concerning read and execution permissions as defined in AEGIS Data Policy and for the artefacts of AEGIS platform such as datasets, services and algorithms. The Brokerage Engine is also responsible for maintaining the records of any action performed over these artefacts.

The AEGIS Query Builder is a graphic tool that can be used to create simple or complex queries in a user-friendly way. It will facilitate the query building procedure with a simple and easy-to-use user interface even for the complicated queries allowing the user to choose from multiple data sources and apply filters with less effort. The results of the execution will serve as input to the AEGIS Algorithm Execution Container and / or to the AEGIS Visualizer.

The AEGIS Algorithm Execution Container is the component where selected or requested algorithms are executed. It consists of two processes, the Algorithm Parameteriser and the Algorithm Executioner. The Algorithm Parameteriser is a small process only responsible for providing the parameter values of the algorithm to be executed, when applicable, as selected by the user to the main process, which is the Algorithm Executioner. The Algorithm Executioner is responsible for the initialization and monitoring of the execution of the selected algorithm, which includes communicating the Big Data Processing Cluster to initiate the execution and waiting for the results of the execution that will be later be provided to Visualizer.

The AEGIS Visualizer provides visualization capabilities on top of the content provided by the Algorithm Execution Container or the results of the query composed and executed by the Query Builder. It provides a variety of bar, line and scatter plots, charts, tables, and maps. Also, the AEGIS Visualizer will provide the ability to the user to quickly create and share flexible, dynamic dashboards.

The AEGIS Orchestrator is responsible for the interconnection of various services of the AEGIS platform, facilitating the flow of information between the services and the execution of the workflows involving several distinct services of the platform, especially the ones involving any integrated service of Hopsworks with the rest of the services of the platform. Thus, the AEGIS orchestrator will act as a mediator between services upon needs, utilizing the exposed interfaces of the services.

Finally, the AEGIS Front-End is the upper layer of the AEGIS platform providing an innovative User Interface for the AEGIS stakeholders. The AEGIS Front-End will provide a user-friendly interface, facilitating the navigation between the AEGIS platform functionalities in a flexible, easy-to-use and secure way.

Each AEGIS platform component was mapped to a set of functionalities which it undertakes, either in a stand-alone manner or in combination with another component and vice-versa; each technical requirement (stemming from corresponding functional and / or non-functional requirements) is allocated to, and undertaken by one (or more in collaboration) platform component. This mapping between components and requirements was documented in the form of a table, an extract of which is provided in below.

Table 3. Mapping between Requirements & Components

ID	Need	Priority	Component
TR7	Process data from structured sources	High	Data Harvester, Data Annotator, Algorithm Execution Container, AEGIS Data Store
TR8	Process data from semi-structured sources	High	Data Harvester, Data Annotator, Algorithm Execution Container, AEGIS Data Store
TR12	Produce RDF Triples	High	Data Harvester, Data Annotator
TR14	Handle big data scalability	High	AEGIS Data Store, Query Builder, Algorithm Execution Container, Big Data Processing Cluster

Blog post authors: UBITECH

[1] https://www.hops.io/

[2] http://hadoop.apache.org/

[3] https://www.elastic.co/products/elasticsearch

[4] https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

AEGIS Data Value Chain & Methodology towards data-driven innovation in PSPS

aegis-admin — Mon, 25 Sep 2017 09:45:57 +0000

This blog discusses the high level methodology of AEGIS.

Step 1: Understanding the stakeholders and their data

One of the challenges we faced during the project’s early steps was to identify all involved parties in the wide set of domains we refer to as Public Safety and Personal Security (PSPS). At a first glance, PSPS might look like a public sector responsibility, however a plethora of private enterprises and organisations are directly or indirectly involved, forming a strong market. Indeed, we came to identify 11 broad stakeholder groups, each of which can be in turn broken down to more specific sub-groups, an indicative set of which is presented in Table 1.

Table 1: AEGIS stakeholders

STAKEHOLDER GROUP	TYPES
SG1 – Smart Insurance	Insurance Companies, Financial institutions, Insurance brokers
SG2 – Smart home	Electronics, Smart home technology providers, Safety and security, Energy and Utilities
SG3 – Smart Automotive	Car manufacturer, Car dealers, Electronics, GPS Navigation System Providers
SG4 – Health	Nursing homes, Hospitals, Doctors
SG5 – Public safety / law enforcement	Police, Emergency Medical Service, Fire Service, Search and Rescue, Military
SG6 – Research communities	Students, Professors, Research institutes
SG7 – Road Construction companies
SG8 – Public sector	Municipalities, Public Authorities
SG9 – IT Industry	IT software companies, Data scientists, Data Industries
SG10 – Smart City	Electronics, Smart City technology providers, Smart City planners
SG11 – End Users	Citizens

As a next step in outlining the way AEGIS will connect these stakeholders through innovative data-driven solutions and enable the creation of new data value chains, we investigated the data produced and consumed by each group. Thus, we identified the data that each of these stakeholders produces or owns, we investigated the way they leverage them now or the reasons they do not and we tried to get a glimpse of both the needs and the opportunities stemming from the data that they would be interested in but are either currently not available or lack an easy way to harvest, process and extract insights from. Detailed reports on this work can be found in our deliverables hereeither play or could play a dual role, i.e. as data producers and data consumers, depending on the PSPS scenario to be implemented and the nature of the services to be provided.

More details about AEGIS cases and scenarios will be provided in upcoming blogposts, but an indicative value chain that immerses from simply looking at the available data is as follows: Insights on road conditions can be extracted from analysing real-time driving sensor data (coming from the Smart Automotive group – SG3), combined with data coming from the road construction companies (SG7) and public data on car accidents provided by Traffic Police and Emergency Medical Services (SG5) and can be leveraged by (a) insurance companies (SG1) towards enhanced car insurance plans and personalised driver notifications on road hazards depending also on current weather and other contextual information, (b) smart city planners (SG10) collaborating with municipalities (SG8) towards implementing a smart lights network aiming to reduce car accidents caused by poor lighting conditions and road deficiencies inside the city.

The potential of establishing new data value chains among stakeholders is truly immense and the integrated AEGIS value chain is potentially so diverse that may even extend to the complete Big Data Ecosystem [1] (Figure 1)

Step 2: Refining the Big Data Value Chain towards the AEGIS needs

As shown in Figure 1, at the center of the Big Data Ecosystem, at the core of all stakeholder value chains, stands the Data Value Chain which controls the way data from one stakeholder are transformed to value distributed to other stakeholders. The Big Data Value Chain, as defined by Edward Curry [1], comprises five main steps, which are adopted at a high-level and customised to the AEGIS needs, as follows:

Figure 2: Big Data Value Chain

Data Acquisition. AEGIS builds upon a large number of diverse data sources, which include real-time streaming data from home/automotive/city/wearable sensors, as well as satellites, proprietary SQL and no-SQL databases, free text data from social media and information sites, resulting in many technical requirements to be further investigated.
Data Analysis. In the scope of AEGIS, data analysis involves a variety of data mining methods, including but not limited to, stream data mining and free text mining, which in turn entail time-series analysis and natural language processing, machine learning, etc. Each of these processes brings a number of challenges, such as time series breakout detection and stream frequent pattern mining (for sensor data), multilingualism and lack of structure (in free text) and lack of agreed upon schemas and data standards almost across all the domain. Moreover, although PSPS applications require high accuracy levels, there are inherent data features, e.g the presence of natural language text to be analysed, that render the required analysis not only more labour-intensive, but also error prone.

Another challenge AEGIS aims to tackle is that often enough the criteria used for the analysis of big data cannot -and, under circumstances should not- be known a-priori, but only in analysis time, in order to ensure that the extracted value is not limited by early erroneous decisions. Hence, explorative analysis is at the core of the data analysis step. Exploratory analysis builds on the fact that when analysis starts, the questions to be answered are not (always) known. Questions only emerge a-posteriori together with the extracted answers, which is the case in many of the AEGIS envisioned applications and services.

Data Curation. In AEGIS this is an umbrella term for various processes regarding data organisation, validation, quality evaluation, and provenance and multiple-purpose annotation. More details on this can be found in our deliverable, so we only mention here 3 important
1. The importance of proper definition and measurement of data quality to avoid compromising the value of the final output.
2. The need to employ traceable and repeatable curation processes so that data curation steps are verifiable against new versions of data and render the detection of new steps possible.
3. The need to avoid irreversible data restructuring. This is a requirement of the previously explained need to enable exploratory analysis, which by definition forbids the application of loss data transformations and compressions, since these may impede future analyses.
Data Storage, i.e. “the persistence and management of data in a scalable way that satisfies the needs of applications”, which will be discussed in future blogposts presenting the AEGIS architecture.
Data Usage. Inside AEGIS, this involves various data-driven business activities, the provision of smart decision support and analytics applications, visual analytics and real-time data exploration across all PSPS related fields, to be showcased through the three project demonstrators.

Step 3: Outlining the AEGIS methodology towards data-driven innovation

In the first two steps we identified the AEGIS stakeholders and their data, as well as the actual big data tasks that need to be performed and the challenges to be addressed in order to accomplish the provision of smarter, innovative data-driven services in the PSPS domains. To make this process more concrete, we outlined the expected interactions of the users with AEGIS, in various settings and for various purposes, and collected a set of features and functionalities enabling them.

Naturally, users will interact with the AEGIS system differently, depending on their reason for using its offerings and their background. As the project progresses, specific roles will be designed, but for now the AEGIS users have been grouped under the following high-level categories:

Data provider: The user’s main objective is to make her/his data available for processing or consuming in the AEGIS system.

Service provider: The user’s main objective is to create a service on top of PSPS data that is available through the AEGIS system, leveraging the set of data processing, analysis, visualisation etc tools provided by the system. In this context, a service may be data (to be consumed as-a-service), visualisations, reports, dashboards, RESTful API endpoints etc.

Service Consumer: The user’s main objective is to consume a service offered through AEGIS. In this context, this includes: accessing a visualisation through a link to AEGIS, downloading a report from AEGIS, performing requests to an AEGIS API endpoint etc.

Administrator: This user has advanced capabilities in the AEGIS system and may perform certain jobs that are not offered by the core platform (e.g. through advanced data curation tools that enable more fine-grained data manipulation and/or schema updates), that require extensions of the current system (e.g. adding a new custom algorithm) etc. This is an auxiliary role to highlight the need for non-automated functionalities in certain tasks.

Categories are not mutually exclusive, but are used to better separate and describe the various workflows enabled in the AEGIS system. In an end-to-end usage of the AEGIS system, a user may transparently transit among the categories of service provider, service consumer and data provider.

Figure 3: Integrated AEGIS methodology (simplified)

Figure 3 presents the integrated AEGIS methodology diagram envisioning the high-level workflows of the above user categories and outlines the way AEGIS will materialize the big data value chain.

Remember to check the references for more details on the work performed so far and stay tuned for our upcoming blog posts!

References

[1] Curry, “The Big Data Value Chain: Definitions, Concepts, and Theoretical Approaches,” in New Horizons for a Data-Driven Economy, Springer, 2016, pp. 29–37.
AEGIS-D1.1 – Domain Landscape Review and Data Value Chain Definition
AEGIS-D1.2 – The AEGIS Methodology and High Level Usage Scenarios-v1.0

Blog post authors: NTUA