10 data science terms every analyst should know


One area that can be overwhelming for newcomers is data science. The term “data science” itself can be confusing as it is an umbrella term that covers many sub-fields: machine learning, artificial intelligence, natural language processing, data mining… the list goes on.

Within each of these subfields, we have a plethora of terminology and industry jargon that overwhelms newcomers and discourages them from pursuing a career in data science.

When I first joined the field, I had to juggle learning the techniques, updating research and advancing in the field, all while trying to understand the lingo. Here are ten basic terms every data scientist should know to build and develop a data science project.

10 data science terms you need to know

  1. Model
  2. Overfitting
  3. Insufficiency
  4. Cross validation
  5. Regression
  6. Setting
  7. Bias
  8. Correlation
  9. Hypothesis
  10. Outlier

More career development in data science4 types of projects you need in your data science portfolio

1. Model

One of the most important data science terms you’ll hear quite often is ‘model’: learning the model, improving the efficiency of the model, the behavior of the model, and so on. But what is a model?

Mathematically speaking, a model is a specification of a probabilistic relationship between different variables. Simply put, a model is a way to describe how two variables behave together.

Because the term “modeling” can be vague, “statistical modeling” is often used to describe modeling done specifically by data scientists.

2. Overfitting

Another way to describe models is how well they fit the data you apply them to.

Overfitting occurs when your model takes into account too much information about this data. So you end up with an overly complex model which is difficult to apply to various training data.

Learn more about modelingA primer on the fit of the model

3. Insufficiency

Under-training (the opposite of over-training) occurs when the model does not have enough information about the data. Either way, you end up with an ill-fitting model.

One of the skills you will need to acquire as a data scientist is knowing how to strike the right balance between overfitting and underlearning.

4. Cross validation

Cross-validation is a way to assess how a model behaves when you ask it to learn from a different set of data than the training data you used to build the model. This is a big concern for data scientists, as your model will often perform well on training data, but end up with too much noise when applied to real data.

There are different ways to apply cross-validation to a model; the three main strategies are:

  1. The retention method – the training data is divided into two sections, one for building the model and one for testing it.

  2. K-fold validation – an improvement over the holdout method. Instead of dividing the data into two sections, you will divide it into k sections to achieve greater precision.

  3. Leave-one-out cross-validation – the extreme case of k-fold validation. Here, k will be the same number of data points in the dataset you are using.

Want more? We got you.Model validation and testing: a step-by-step guide

5. Regression

Regression is a machine learning term – the simplest and most basic supervised machine learning approach. In regression problems, you often have two values, a target value (also called criterion variables) and other values, called predictors.

For example, we can look at the labor market. The ease or difficulty of obtaining a job (criterion variable) depends on the demand for the job and the supply for it (predictors).

There are different types of regression to match different applications; the simplest are linear and logistic regressions.

6. Parameter

The parameter can be confusing because it has slightly different meanings depending on the scope in which you are using it. For example, in statistics, a parameter describes the different properties of a probability distribution (for example, its shape, its scale). In data science or machine learning, we often use parameters to describe the accuracy of system components.

In machine learning, there are two types of models: parametric and non-parametric models.

  1. Parametric models have a defined number of parameters (characteristics) not affected by the number of training data. Linear regression is considered a parametric model.

  2. Non-parametric models do not have a defined number of features, so the complexity of the technique increases with the number of training data. The best known example of a nonparametric model is the KNN algorithm.

7. Bias

In data science, we use bias to refer to an error in the data. Bias appears in the data due to sampling and estimation. When we choose certain data to analyze, we often sample a large pool of data. The sample you select could be biased, as it could be an inaccurate representation of the pool.

Since the model we are training only knows the data we provide to it, the model will only learn what it can see. That’s why data scientists need to be careful to create unbiased models.

Want to know more about bias? There is an article for that.An introduction to the bias-variance trade-off

8. Correlation

In general, we use correlation to denote the degree of occurrence between two or more events. For example, if cases of depression increase in cold regions, there might be some correlation between cold and depression.

Often the events are correlated to different degrees. For example, following a recipe that makes a delicious dish may have a higher correlation than depression and cold. We call this the correlation coefficient.

When the correlation coefficient is one, the two events in question are highly correlated, whereas if it is, say, 0.2, then the events are weakly correlated. The coefficient can also be negative. In this case, there is an inverse relationship between two events. For example, if you eat well, your chances of becoming obese will decrease. There is an inverse relationship between a balanced diet and obesity.

Finally, you should always remember the axiom of all data scientists: Correlation is not synonymous with causation.

You get data science, and YOU get data science!The Poisson process and the Poisson distribution, explained

9. Hypothesis

A hypothesis, in general, is an explanation of an event. Often, assumptions are made based on past data and observations. A valid hypothesis is one that you can test with results, true or false.

In statistics, a hypothesis must be falsifiable. In other words, we should be able to test any hypothesis to determine whether it is valid or not. In machine learning, the term hypothesis refers to candidate models that we can use to map model inputs to the correct and valid output.

10. Outlier

Outlier is a term used in data science and statistics to refer to an observation that is at an unusual distance from other values ​​in the data set. The first thing every data scientist should do when given a dataset is decide what is considered a usual distance and what is unusual.

Dive into distributions4 probability distributions that every data scientist needs

An outlier can represent different things in the data; it could be noise that occurred during data collection or a way to spot rare events and unique patterns. Therefore outliers should not be removed right away. Instead, make sure you always investigate your outliers like the good data scientist that you are.

This article was originally published on Towards data science.


Comments are closed.