Prerequisites: https://www.geeksforgeeks.org/decision-tree/
Random Forest Algorithm
Random forest algorithm can use both for classification and the regression kind of problems. Random Forest is a supervised learning algorithm. It creates a forest to evaluate results. Random Forest builds multiple decision trees by picking ‘K’ number of data points point from the dataset and merges them together to get a more accurate and stable prediction.
For each ‘K’ data points decision tree we have many predictions and then we take the average of all the predictions.
Random forest is an Ensemble learning Algorithm. Ensemble learning is the process by which multiple models combine together to predict one result.
Package used
1. Regressor
from sklearn.ensemble import RandomForestRegressor
2. Classifier
from sklearn.ensemble import RandomForestClassifier
Parameters: |
n_estimators : integer, optional (default=10)
criterion : string, optional (default=”gini”)
max_features : int, float, string or None, optional (default=”auto”)
max_depth : integer or None, optional (default=None)
max_leaf_nodes : int or None, optional (default=None)
bootstrap : boolean, optional (default=True)
oob_score : bool (default=False)
n_jobs : integer, optional (default=1)
random_state : int, RandomState instance or None, optional (default=None)
|
---|
Algorithm of Random Forest Regression or Classification
(a) Draw a bootstrap sample Z of size N from the training data.
(b) Grow a random-forest tree Tb to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.
2. Output the ensemble of trees {Tb}
Bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates.
Criterion Parameter:
1.For Classification: Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.
2.For Regressor: Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
Advantages of Random Forest Algorithm:
1. Random Forest is also considered as a very handy and easy to use the algorithm because it’s default hyperparameters often produce a good prediction result.
2. For applications in classification problems, the Random Forest algorithm will avoid the overfitting problem.
3. We can use it for both classification and regression.
4. Less variance
Disadvantage:
1. Random Forest with a large number of trees can make the algorithm slow and ineffective for real-time predictions.
Overall, Random Forest is a (mostly) fast, simple and flexible tool, with great accuracy and no overfitting(if the correct number of trees selected).
Implementation Code(Random Forest Classifier)
Step-Wise Explaination
1. Import Libraries
2. Read the CSV file and store in the variable dataset.
3. Find dependent and independent variable
4. split the dataset using train_test_split with an accurate test size.
5. Import RandomForestClassifier
6. Set the parameters of the number of trees and criteria of the split. (entropy or gini)
7. Model is ready to fit.
8. y_pred stores the predicted values.
9. Confusion matrix to find the accuracy of the model trained and check predictions.
#Import Libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline #Read csv file dataset = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv') #Overview of dataset dataset.head() dataset.info() #Choose X and y after checking dataset X = dataset.iloc[:,1:3].values y = dataset.iloc[:,3].values #Train test Split data from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3) #Import Classifier from sklearn.ensemble import RandomForestClassifier #Create Object classifier = RandomForestClassifier(n_estimators=100,random_state=0,criterion='entropy') #Model Fiting classifier.fit(X_train,y_train) #Prediction y_pred=classifier.predict(X_test) #Confusion Matrix Result from sklearn.metrics import confusion_matrix,classification_report confusion_matrix(y_test,y_pred)
Reference: Link
Leave a Reply