Datasets and Time Series

Organizations are revamping their data science and engineering strategies to acquire the skills needed to deploy artificial intelligence (AI) and machine learning (ML) systems.

Companies are now hiring legions of data scientists and other data experts to build artificial intelligence, machine learning and deep learning (DL) applications, analytical translators trained to connect business domains and technical, and qualified front-line personnel to effectively use advanced technological applications.

One role in particular, that of data scientist, has been particularly difficult for leaders to fulfill as competition for its illusory knowledge has grown.

Data_Science_Process.jpg

Data science and engineering is a transdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract causal knowledge from structured and unstructured data, and apply causal insights from data in a wide range of application areas.

The combination of data science and engineering can help design and build data/information processing systems in information technology, digital electronic computers, artificial intelligence, machine learning, deep learning and even data analysis systems.

Data science and engineering aims to unify computer and information science, mathematics and statistics, data analysis and empirical science, all within frameworks of data typology (ontology) , to understand and analyze real-world data.

Data_Science_Evolution.png

Data sets in the context of data science and engineering

A dataset is a collection of one or more tables, diagrams, points and/or objects grouped together either because they are stored in the same location or because they are related to the same subject.

Several classic datasets have been widely used in the statistical literature:

  • Iris Flower Dataset – Multivariate dataset introduced by Ronald Fisher (1936).

  • MNIST Database – Images of handwritten digits commonly used to test classification, clustering and image processing algorithms

  • Categorical Data Analysis – Datasets used in the book, An Introduction to Categorical Data Analysis.

  • Robust statistics – Datasets used in robust regression and outlier detection (Rousseeuw and Leroy, 1986). Provided online at the University of Cologne.

  • Time Series – The data used in Chatfield’s book, The Analysis of Time Series, is provided online by StatLib.

  • Extreme Values ​​- The data used in the book, An Introduction to Statistical Extreme Value Modeling is a snapshot of the data as provided online by Stuart Coles, the author of the book.

  • Bayesian Data Analysis – The data used in the book is provided online by Andrew Gelman, one of the book’s authors.

  • Bupa’s liver data – Used in several papers in the machine learning (data mining) literature.

  • Anscombe Quartet – Small dataset illustrating the importance of graphically representing data to avoid statistical errors

Time series in the context of data science and engineering

A time series is a series of data points indexed (or listed or graphed or plotted via flow charts (a time line chart)) in temporal order; a sequence taken at successive equidistant moments in time; a sequence of discrete time data, e.g. letters and words, seasonal precipitation, ocean tide height, sunspot number, Dow Jones Industrial Average, etc.

Time series are used in all areas of applied science and engineering that involve time measurements, statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting , earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering.

In the context of pattern recognition and machine learning, time series analysis can be used for clustering, classification, content query, anomaly detection as well as forecasting.

A time series has a natural temporal ordering applied to real-valued continuous data, discrete numeric data, or discrete symbolic data (sequences of characters, such as letters and words) with the primary purpose of time series analysis predictable.

Granger’s statistical causality test, which is not any causality, is widely applied in econometrics to determine whether one time series is useful in predicting another. Rather than testing whether Y causes X, Granger causality tests whether Y predicts X.

When the X Granger time series causes the Y time series, the patterns in X are approximately repeated in Y after some time lag. Thus, past values ​​of X can be used for the prediction of future values ​​of Y.

Altogether, time series data is a type of panel data, the general class, a multidimensional data set.

A number of tools for studying time series data are hard to count:

  • Principal component analysis (or empirical orthogonal function analysis)

  • Singular spectrum analysis

  • States-General Space Models

  • Dynamic time warping

  • Intercorrelation

  • Dynamic Bayesian Network

  • Time-Frequency Analysis Techniques

  • machine learning

  • Artificial neural networks

  • Support vector machine

  • Fuzzy logic

  • Gaussian process

  • Hidden Markov model

As a combination of time-series analysis and cross-sectional analysis, panel data appear more meaningful or informative, being widely used in the social sciences for:

  1. understand the causality of events that may have occurred in the past and how they lead to outcomes seen in later waves of data, such as the effect of passing a new law on crime statistics, or a natural disaster about births and deaths years later, or how stock prices react to merger and earnings announcements;

  2. to track trends and changes over time by asking the same respondents in multiple waves over time, such as to measure poverty and income inequality by tracking individual households;

  3. monitor business profitability, risks and understand the effect of economic shocks or determine the factors that most affect unemployment;

  4. to perform regression analysis, to determine how many specific factors such as the price of a commodity, interest rates, particular industries or sectors influence the price movement of an asset;

  5. to the study of events to analyze the impact of an event, such as Covid-19, on an industry, a sector or the whole market, etc.

Time series data is studied by data science and engineering, which is about creating value from data; to better understand data relationships, discover insights, predict future trends and behaviors, and make smart decisions.

Conclusion

Organizations are struggling to compete with startups and tech giants like Google, Meta, and Twitter to attract or retain top data scientists and the next generation of graduates.

Due to the Covid-19 pandemic, tech start-ups can struggle to survive, making it easier for incumbents to learn those hard-to-learn skills.

How_AutoMl_Works.png

There are also new tools that have the potential to fill the data science talent gap and increase the efficiency of analytics teams. Automated machine learning (ML) tools, commonly referred to as AutoML, are designed to automate many stages of machine learning model development.

Data science and engineering require deep thinking and multidisciplinary collaboration.

In order to stay competitive, organizations will be better served by not putting all of their resources into the fight for scarce technical talent, but by focusing at least some of their attention on building their troop of AutoML practitioners, who will become a proportion substantial talent pool for the next decade.

Comments are closed.