An overview of data science's statistics and linear algebra
What is it?
Statistics is a “branch of mathematics dealing with the collecting, classification, analysis, and interpretation of numerical data for making inferences based on their measurable likelihood,” according to Business Dictionary. Or, to put it another way, statistics is the process of gathering, examining, and making conclusions from data.
Significance in Data Science
There are several ways that statistics come into play in data science practices. These techniques can be quite beneficial for deciphering data and yielding intriguing outcomes.
Experimental Design: Whether you realize it or not, you will probably run some kind of experiment to get the answer if you have a question and want it to be answered. This includes planning the experiment and taking care of the sample sizes, control groups, and other issues.
- Frequent Statistics: Using statistical practices such as confidence intervals and hypothesis tests allows you to determine how much a result or data point matters. You will become a better data scientist if you can determine the importance and other key facts from data.
- Modeling: Techniques such as regression and clustering are used often in data science for modeling work. Whether you’re trying to predict something, find the bigger picture in data, or understand the reasoning behind data; chances are you will end up using some sort of predictive modeling.
How Important is it?
This is where things start to become a little hazy and opinions start to differ. I suggest that we divide statistics into two categories—new and old—in order to address this query.
Old statistics such as regression and hypothesis tests are simplistic by nature. While they can be useful, many distinguished data scientists predict them to be used less and less. Stating that these concepts will likely become less important as we move forward and statistical techniques evolve. On the other hand, new statistics like decision trees and predictive power are very useful and are used often by data scientists.
All this being said, I still recommend aspiring data scientists work through general statistical theories and practices. Even if you won’t be using them in everyday work, they still are very helpful in helping you progress up to more advanced concepts that you will use regularly while training analytical thinking.
What is it?
The area of mathematics known as linear algebra deals with vector spaces and linear mappings between them. This is a fairly solid start, but I think we can make that definition a little clearer and less academic.
Linear algebra put simply, is arithmetic that deals with straight things in space.
Significance in Data Science
The following are some of the most well-known applications of linear algebra in data science right now:
- Machine learning: Too many machine learning strategies incorporate elements of linear algebra. Just to name a few, there are principal component analysis, eigenvalues, and regression. This is particularly true when working with high-dimensional data, which frequently contain matrices.
- Modeling: If you want to model behaviors in any way, you will likely find yourself using a matrix to break down samples into subgroups before doing so to establish accurate results. You must apply general matrix mathematics, such as inversion, derivation, and more, in order to complete this deed.
- Optimization: Any data scientist can benefit greatly from understanding the different least squares algorithms. It can be applied to clustering, dimensionality reduction, and other things. All of which contribute to the improvement of networks or projections.
The repeated use of the words “matrix” and “matrices,” which some of you may have noticed, is not a coincidence. The theory of general linear algebra makes extensive use of matrices. Tables and data frames, two important data structures used in data research, can benefit equally from the techniques employed for matrices.