2. End-to-End ML Project
I summarized the important steps and libraries used in the ML project rather than focusing on the codes. The notebook for this chapter is available at github.com/ageron/handson-ml2
Last updated
Was this helpful?
I summarized the important steps and libraries used in the ML project rather than focusing on the codes. The notebook for this chapter is available at github.com/ageron/handson-ml2
Last updated
Was this helpful?
The main steps author goes through in this chapter are:
Look at the big picture.
Get the data.
Discover and visualize the data to gain insights.
Prepare the data for Machine Learning algorithms.
Select a model and train it.
Fine-tune your model.
Present your solution.
Launch, monitor, and maintain your system.
In this chapter, the author goes through a project in which the goal is to predict the median housing price in any district using the Californa Housing Prices dataset.
You are pretending to be a recently hired Data Scientist at a real estate company. Following the steps above you are expected to create a machine learning model that will be fed to another ML system, along with many other signals (a piece of information is called signal). Below is an example visualization of an ML pipeline for real estate investments.
Pipelines: A sequence of data processing components is called a data pipeline. Pipelines are de facto standards of Machine Learning systems since there is a lot of data to preprocess and various transformations to apply.
Below the definitions of two concepts often mistaken for each other are given:
Multiple Regression: A regression model that consists of multiple features called a multiple (univariate if one target variable exists) regression model.
( multiple-univariate regression
Multivariate Regression: A regression model in which the numbers of target variables are more than one is called a multivariate regression model.
( multivariate-multiple where is feature matrix.)
Notice, the feature matrix is (num. of samples x num. features), can also be written as a (horizontal) stack of feature vectors that are column vectors. However, one can also write it starting with an vector of 's and include bias term inside the weight matrix This is often used in linear regression settings; however, are used frequently in neural networks setting. This notation will help to adjust DL notations easier.
In the book, the prediction of sampleis represented by We will use here. Some of the most common loss functions used in regression are (I included more):
Mean Squared Error:
This is simply a rescaled version of a squared (Euclidean) norm of the error vector.
Root Mean Square Error:
Again, this is another rescaled version of an (Euclidean) norm of the error vector.
Mean Absolute Error:
This error is a rescaled version of an (Manhattan) norm of the error vector.
Root Mean Absolute Error:
This is again a rescaled version of the square root of an(Manhattan) norm of the error.
In general, the vector norm is given by:
Some of the notes about vector norms:
RMSE corresponds to the Euclidean norm, whereas MAE does to the Manhattan norm.
gives the number of nonzero elements of and the maximum of absolute values of components of
The higher the norm index, the more focuses on large values, and the effects of small ones become negligible. Therefore, RMSE is more sensitive to outliers than the MAE. But, when outliers are exponentially rare (like in normal curve), the RMSE outperforms MAE and it is preferred, in general.
Quantile: They are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
Percentile: The value below which k-percentage of the observations fall is called percentile.
Quartile: A quartile is a type of quantile which divides the number of data points into four more or less equal parts or quarters.
Example: The percentile is the quartile and it is commonly known as the median.
Note: The standard deviation of the population is denoted by sample std with and it is the square root of the variance, which is the average of the squared deviation from the mean. When a feature has bell-shaped (Gaussian, or normal) distribution, the rule applies: about of the values fall within of the mean, within within This is very important as it is commonly used while finding confidence intervals.
The easiest way to split a given data into train and test sets is by usingsklearn
's train_test_split
method.
The random_state
parameter is used to reproduce results while using random generators. Otherwise' each time you use a random generator, you will get a different (random) answer. This method of splitting data is fine if the dataset is large enough (compared to the num. of features), but if it is not, then it may produce significant sampling bias. Some of the common bias types in statistics are explained here.
Sampling bias: In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others.
In order to avoid sampling bias, we can utilize stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population. Stratified sampling (into 5 strata) can be achieved by using pandas
' s cut
method together with sklearn
's StratifiedShuffleSplit
object as given below:
The main idea in StratifiedShuffleSplit
is that train and test sets will follow an almost equal distribution of categories. Check the figure below.
The test set generated using stratified sampling would have income category proportions almost identical to those in the full dataset, whereas the test set generated using purely random sampling will be skewed. Below is the results of the sampling experiment:
Finally, by removing the target variable, you can select train and test sets to its original state.
You can plot some portion of the data either usingseaborn
's scatterplot
or pairplot
methods. The example below uses pairplot.