This is an introductory article about the statistical aspects of data science. In this article, I will give a general overview of data science problems and the statistical techniques required to solve them. Some of the terminologies may sound new to you. But don’t worry, in subsequent articles, I will elaborate (with examples) on each of these concepts in detail.
Data science starts with statistics and ends with statistics. Statistics is the backbone of data science. The primary goal of data science is to understand a randomized process and predict its future observation. We denote the output of this process by a random variable Y. Y is random because we don’t know the exact output of Y in a particular instance of the process. We try to predict Y with the help of some other causal variable X = (X1,X2,…,Xp), simultaneously observed in the process. For example, suppose we want to predict the price (Y) of a flat based on its floor size (X1) and height of roof (X2). Here Y is random because, with the same floor size and same roof height, two different flats may have different prices. A data set is a tabulated representation of past values of (X, Y). With the help of this data set, we try to build a mathematical relationship between X and Y as
Here β=(β1,β2,…,βp) is an unknown parameter vector (each βiis associated with an Xi) and µ, f are the mathematical functions. Depending on the nature of a problem, we assume suitable functional assumptions on µ and f. Using the given data we then estimate β as β ̂ so that the prediction error of Y in the data is minimum. Now for any future data X, we predict µ(Y) as (µ(Y) ) ̂ and consequently, Y ̂ as a prediction of Y, using the estimated relation
(µ(Y) ) ̂ =f(X ;β ̂ ) (2)
In our house price prediction
problem, we may be interested in building a model like
E(Y│X)= β1 X(1 )+ β2 X(2 ), (3)
where µ(Y)= E(Y│X), the expected value of Y given X = (X1, X2) and
f(X;β)= β1 X(1 )+ β2 X(2 ).
Using past data, suppose we estimate
(E(Y│X) ) ̂ = 8000X_(1 )+ 1000X_(2), (4)
where β ̂_1=8000 and β ̂_2=1000. Now if you are looking for a flat of 1200 square feet with 10 feet roof height, you may expect the price of the flat would be 8000*1200 + 1000*10 = 96,10,000. This whole process is called model building. In data science, it is termed as Machine Learning (ML) model. It works in three major steps and they are:
a) Data pre-processing
b) Structural model assumption
c) Model building.
In each of these steps we extensively use statistics. Now we will discuss them separately along with the statistical techniques required for each step.
- Data pre-processing: Each ML model tries to find out a hidden pattern in the data between X and Y. Therefore, it requires a well-organized data set. Unfortunately, the original raw data is ill-organized in most of the cases.
Creating a well-organized transformed data from the original raw data that an ML model understands well is called data pre-processing. Removing the missing values, outliers, and unnecessary rows and columns, or replacing them with proper values are the major steps of data pre-processing. To do so, we need a clear positional and distributional understanding of each variable in the data set. Descriptive statistics help us to get this understanding. To get the positional aspect of the data, we use various measures of central tendencies and deviations and to get the distributional aspect of the data, we use various graphs and charts.
Topics to learn: Minimum value, maximum value, range, percentile, mean,
median, mode, variance, standard deviation, absolute deviation, skewness,
kurtosis, pie chart, bar chart, histogram, box-plot etc.
- Structural model assumption: Structural model assumption talks about the form of µ(Y). In data science we generally take probabilistic assumptions on µ(Y). Probability distributions and expectations play a great role here. If Y has numerical outputs, we take µ(Y) as E(Y|X), the expected value of Y given X. If Y has binary or multiple categorical outputs, we take µ(Y) as P(Y|X), the conditional probability of Y given X. For numeric Y, we assume Y|X follows a normal distribution and for categorical Y, we assume Y|X follows a binomial or multinomial distribution. It can be shown mathematically that when Y is numeric, the expected value E(Y|X) is the best predictor for future Y and when Y is categorical, then Y = K is the best predictor for Y, if P(Y=K|X) has the highest probability value.
Topics to learn: Univariate and multivariate
probability distribution, conditional probability distribution, expectation,
conditional expectation, normal distribution, binomial distribution,
multinomial distribution etc.
- Model building: This is the final step of the ML model building. In this step, we decide the form of f(X;β) and estimate the unknown β. Statistical inference plays a great role here. We first estimate β from the data using a suitable point estimation technique. Each estimated βi is then tested for its significance in the model. The insignificant βis (and corresponding Xis) are discarded from the model. We also test the overall validity of the model. Interval estimation and various hypothesis testing techniques help us to decide the best model for the data set.
Topics to learn: Point estimation (Maximum likelihood estimation, Bayesian estimation, Least square estimation), Interval estimation, Hypothesis testing (z-test, t-test, chi-square test), p-value, etc.
Hope this article was helpful in giving a fair idea about the various Statistical techniques that are known today.
Any questions, feedback, suggestions for improvement are most welcome. 🙂