Random Forest Algorithm | Machine Learning

Prerequisites: https://www.geeksforgeeks.org/decision-tree/

Random Forest Algorithm

Random forest algorithm can use both for classification and the regression kind of problems. Random Forest is a supervised learning algorithm. It creates a forest to evaluate results. Random Forest builds multiple decision trees by picking ‘K’ number of data points point from the dataset and merges them together to get a more accurate and stable prediction.

For each ‘K’ data points decision tree we have many predictions and then we take the average of all the predictions.

Random forest is an Ensemble learning Algorithm. Ensemble learning is the process by which multiple models combine together to predict one result.

Package used

1. Regressor

 from sklearn.ensemble import RandomForestRegressor

2. Classifier

from sklearn.ensemble import RandomForestClassifier

Parameters:

n_estimators : integer, optional (default=10)

The number of trees in the forest.

criterion : string, optional (default=”gini”)

The function to measure the quality of a split. Supported criteria are

“gini” for the Gini impurity and “entropy” for the information gain.

Note: this parameter is tree-specific.

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.

If float, then max_features is a percentage and

int(max_features * n_features) features are considered at each

split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

Note: the search for a split does not stop until at least one

valid partition of the node samples is found, even if it requires to

effectively inspect more than max_features features.

max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until

all leaves are pure or until all leaves contain less than

min_samples_split samples.

max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion.

Best nodes are defined as relative reduction in impurity.

If None then unlimited number of leaf nodes.

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

oob_score : bool (default=False)

Whether to use out-of-bag samples to estimate

the generalization accuracy.

n_jobs : integer, optional (default=1)

The number of jobs to run in parallel for both fit and predict.

If -1, then the number of jobs is set to the number of cores.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;

If RandomState instance, random_state is the random number generator;

If None, the random number generator is the RandomState instance used

by np.random.

Algorithm of Random Forest Regression or Classification

(a) Draw a bootstrap sample Z of size N from the training data.

(b) Grow a random-forest tree Tb to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size nmin is reached.

i. Select m variables at random from the p variables.

ii. Pick the best variable/split-point among the m.

iii. Split the node into two daughter nodes.

2. Output the ensemble of trees {Tb}

Bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates.

Criterion Parameter:

1.For Classification: Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

2.For Regressor: Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.

Advantages of Random Forest Algorithm:

1. Random Forest is also considered as a very handy and easy to use the algorithm because it’s default hyperparameters often produce a good prediction result.

2. For applications in classification problems, the Random Forest algorithm will avoid the overfitting problem.

3. We can use it for both classification and regression.

4. Less variance

Disadvantage:

1. Random Forest with a large number of trees can make the algorithm slow and ineffective for real-time predictions.

Overall, Random Forest is a (mostly) fast, simple and flexible tool, with great accuracy and no overfitting(if the correct number of trees selected).

Implementation Code(Random Forest Classifier)

Step-Wise Explaination

1. Import Libraries

2. Read the CSV file and store in the variable dataset.

3. Find dependent and independent variable

4. split the dataset using train_test_split with an accurate test size.

5. Import RandomForestClassifier

6. Set the parameters of the number of trees and criteria of the split. (entropy or gini)

7. Model is ready to fit.

8. y_pred stores the predicted values.

9. Confusion matrix to find the accuracy of the model trained and check predictions.


#Import Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

#Read csv file

dataset = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')

#Overview of dataset

dataset.head()

dataset.info()

#Choose X and y after checking dataset

X = dataset.iloc[:,1:3].values

y = dataset.iloc[:,3].values

#Train test Split data

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

#Import Classifier

from sklearn.ensemble import RandomForestClassifier

#Create Object

classifier = RandomForestClassifier(n_estimators=100,random_state=0,criterion='entropy')

#Model Fiting 

classifier.fit(X_train,y_train)

#Prediction

y_pred=classifier.predict(X_test)

#Confusion Matrix Result

from sklearn.metrics import confusion_matrix,classification_report

confusion_matrix(y_test,y_pred)

Reference: Link

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: