Algorithms for Basic Statistics and Knowledge Extraction

posted in: Blog | 0

When talking about data analysis, sometimes it can seem tempting to jump straight into the analysis part. Real-world data is typically noisy, enormous in volume, comes from heterogeneous source, each having their own rules of representing data.  Before attempting the different analysis tasks, it might make sense to pre-process the data in order to make it more accessible to future tasks. Some of the questions we might ask are: What type of attributes does our data have? What kind of values does each of the attribute have? Are they discrete or continuous values? How are the values distributed? Are there outliers? How similar are they?

In order to efficiently answer some of the above-mentioned questions, AEGIS proposes a set of algorithms that cover most of the complex aspects of the tasks in hand. These algorithms are selected with the following criteria in mind: a) to adhere to a general approach to cover the AEGIS platform requirements but also a wider variety of problems, b) to have proven their ability and robustness in various domains through the years, and c) to have been implemented in a commonly used software framework or library.

In this effort, the consortium has focused on the following software implementation and libraries:

Out of those, we identified two main categories: Basic Statistics Algorithms and Knowledge Extraction Algorithms and went on to briefly describe each one, identify libraries and software implementation of those and of course their main purpose.

Regarding Basic Statistics an analysis has been done on the follow algorithms:

  • Measuring central tendencies
  • Measuring dispersion of data
  • Correlation (Pearson’s and Spearman’s Correlation)
  • Stratified sampling
  • Hypothesis testing: Pearson’s chi-squared tests for goodness of fit

In terms of Knowledge Extraction Algorithms, we have identified the following list of sub-categories, where each one includes various methods for knowledge extraction

  • Feature extraction-dimensionality reduction-natural language processing
  • Clustering methods
  • Classification-regression methods
  • Recommendation systems
  • Expert systems

To have a more detailed view on the AEGIS work on the identification of algorithms, you can download our deliverable D2.2 ‑ AEGIS Data Value Chain Bus Definition and Data Analysis Methods

Blog post authors: Suite5