Step-by-Step R-CNN Implementation From Scratch In Python

Classification and object detection are the main parts of computer vision. Classification is finding what is in an image and object detection and localisation is finding where is that object in that image. Detection is a more complex problem to solve as we need to find the coordinates of the object in an image.

To Solve this problem R-CNN was introduced by Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik in 2014. R-CNN stands for Regions with CNN. In R-CNN instead of running classification on huge number of regions we pass the image through selective search and select first 2000 region proposal from the result and run classification on that. In this way instead of classifying huge number of regions we need to just classify first 2000 regions. This makes this algorithm fast compared to previous techniques of object detection. There are 4 steps in R-CNN. They are as follows :-

  1. Pass the image through selective search and generate region proposal.
  2. Calculate IOU (intersection over union) on proposed region with ground truth data and add label to the proposed regions.
  3. Do transfer learning using the proposed regions with the labels.
  4. Pass the test image to selective search and then pass the first 2000 proposed regions from the trained model and predict the class of those regions.

I am going to implement full R-CNN from scratch in Keras using Airplane data-set from . To get the annotated data-set you can download it from link below. The code for implemented RCNN can also be found in the below mentioned repository.

Once you have downloaded the dataset you can proceed with the steps written below.

import os,cv2,keras
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

First step is to import all the libraries which will be needed to implement R-CNN. We need cv2 to perform selective search on the images. To use selective search we need to download opencv-contrib-python. To download that just run pip install opencv-contrib-python in the terminal and install it from pypi.

ss = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()

After downloading opencv-contrib we need to initialise selective search. For that we have added the above step.

def get_iou(bb1, bb2):
assert bb1['x1'] < bb1['x2']
assert bb1['y1'] < bb1['y2']
assert bb2['x1'] < bb2['x2']
assert bb2['y1'] < bb2['y2'] x_left = max(bb1['x1'], bb2['x1'])
y_top = max(bb1['y1'], bb2['y1'])
x_right = min(bb1['x2'], bb2['x2'])
y_bottom = min(bb1['y2'], bb2['y2']) if x_right < x_left or y_bottom < y_top:
return 0.0 intersection_area = (x_right - x_left) * (y_bottom - y_top) bb1_area = (bb1['x2'] - bb1['x1']) * (bb1['y2'] - bb1['y1'])
bb2_area = (bb2['x2'] - bb2['x1']) * (bb2['y2'] - bb2['y1']) iou = intersection_area / float(bb1_area + bb2_area - intersection_area)
assert iou >= 0.0
assert iou <= 1.0
return iou

Now we are initialising the function to calculate IOU (Intersection Over Union) of the ground truth box from the box computed by selective search. To understand more about calculating IOU you can refer to the link below.

for e,i in enumerate(os.listdir(annot)):
if i.startswith("airplane"):
filename = i.split(".")[0]+".jpg"
image = cv2.imread(os.path.join(path,filename))
df = pd.read_csv(os.path.join(annot,i))
for row in df.iterrows():
x1 = int(row[1][0].split(" ")[0])
y1 = int(row[1][0].split(" ")[1])
x2 = int(row[1][0].split(" ")[2])
y2 = int(row[1][0].split(" ")[3])
ssresults = ss.process()
imout = image.copy()
counter = 0
falsecounter = 0
flag = 0
fflag = 0
bflag = 0
for e,result in enumerate(ssresults):
if e < 2000 and flag == 0:
for gtval in gtvalues:
x,y,w,h = result
iou = get_iou(gtval,{"x1":x,"x2":x+w,"y1":y,"y2":y+h})
if counter < 30:
if iou > 0.70:
timage = imout[y:y+h,x:x+w]
resized = cv2.resize(timage, (224,224), interpolation = cv2.INTER_AREA)
counter += 1
else :
fflag =1
if falsecounter <30:
if iou < 0.3:
timage = imout[y:y+h,x:x+w]
resized = cv2.resize(timage, (224,224), interpolation = cv2.INTER_AREA)
falsecounter += 1
else :
bflag = 1
if fflag == 1 and bflag == 1:
flag = 1
except Exception as e:
print("error in "+filename)
Running selective search on an image and getting proposed regions

The above code is pre-processing and creating the data-set to pass to the model. As in this case we can have 2 classes. These classes are that whether the proposed region can be a foreground (i.e. Airplane) or a background. So we will set the label of foreground (i.e. Airplane) as 1 and the label of background as 0. The following steps are being performed in the above code block.

  1. Loop over the image folder and set each image one by one as the base for selective search using code ss.setBaseImage(image)
  2. Initialising fast selective search and getting proposed regions using using code ss.switchToSelectiveSearchFast() and ssresults = ss.process()
  3. Iterating over all the first 2000 results passed by selective search and calculating IOU of the proposed region and annotated region using the get_iou() function created above.
  4. Now as one image can many negative sample (i.e. background) and just some positive sample (i.e. airplane) so we need to make sure that we have good proportion of both positive and negative sample to train our model. Therefore we have set that we will collect maximum of 30 negative sample (i.e. background) and positive sample (i.e. airplane) from one image.

After running the above code snippet our training data will be ready. List train_images=[] will contain all the images and train_labels=[] will contain all the labels marking airplane images as 1 and non airplane images (i.e. background images) as 0.

Positive sample on right, Negative sample on left
X_new = np.array(train_images)
y_new = np.array(train_labels)

After completing the process of creating the dataset we will convert the array to numpy array so that we can traverse it easily and pass the datatset to the model in an efficient way.

from keras.layers import Dense
from keras import Model
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
from keras.optimizers import Adamfrom keras.applications.vgg16 import VGG16
vggmodel = VGG16(weights='imagenet', include_top=True)

Now we will do transfer learning on the imagenet weight. We will import VGG16 model and also put the imagenet weight in the model. To learn more about transfer learning you can refer to the article on link below.

for layers in (vggmodel.layers)[:15]:
layers.trainable = FalseX= vggmodel.layers[-2].output
predictions = Dense(2, activation="softmax")(X)
model_final = Model(input = vggmodel.input, output = predictions)
opt = Adam(lr=0.0001)
model_final.compile(loss = keras.losses.categorical_crossentropy, optimizer = opt, metrics=["accuracy"])

In this part in the loop we are freezing the first 15 layers of the model. After that we are taking out the second last layer of the model and then adding a 2 unit softmax dense layer as we have just 2 classes to predict i.e. foreground or background. After that we are compiling the model using Adam optimizer with learning rate of 0.001. We are using categorical_crossentropy as loss since the output of the model is categorical. Finally the summary of the model will is printed using model_final.summary(). The image of summary is attached below.

Model Summary
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizerclass MyLabelBinarizer(LabelBinarizer):
def transform(self, y):
Y = super().transform(y)
if self.y_type_ == 'binary':
return np.hstack((Y, 1-Y))
return Y
def inverse_transform(self, Y, threshold=None):
if self.y_type_ == 'binary':
return super().inverse_transform(Y[:, 0], threshold)
return super().inverse_transform(Y, threshold)lenc = MyLabelBinarizer()
Y = lenc.fit_transform(y_new)X_train, X_test , y_train, y_test = train_test_split(X_new,Y,test_size=0.10)

After creating the model now we need to split the dataset into train and test set. Before that we need to one-hot encode the label. For that we are using MyLabelBinarizer() and encoding the dataset. Then we are splitting the dataset using train_test_split from sklearn. We are keeping 10% of the dataset as test set and 90% as training set.

trdata = ImageDataGenerator(horizontal_flip=True, vertical_flip=True, rotation_range=90)
traindata = trdata.flow(x=X_train, y=y_train)
tsdata = ImageDataGenerator(horizontal_flip=True, vertical_flip=True, rotation_range=90)
testdata = tsdata.flow(x=X_test, y=y_test)

Now we will use Keras ImageDataGenerator to pass the dataset to the model. We will do some augmentation on the dataset like horizontal flip, vertical flip and rotation to increase the dataset.

from keras.callbacks import ModelCheckpoint, EarlyStoppingcheckpoint = ModelCheckpoint("ieeercnn_vgg16_1.h5", monitor='val_loss', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)early = EarlyStopping(monitor='val_loss', min_delta=0, patience=100, verbose=1, mode='auto')hist = model_final.fit_generator(generator= traindata, steps_per_epoch= 10, epochs= 1000, validation_data= testdata, validation_steps=2, callbacks=[checkpoint,early])

Now we start the training of the model using fit_generator.

for e,i in enumerate(os.listdir(path)):
if i.startswith("4"):
z += 1
img = cv2.imread(os.path.join(path,i))
ssresults = ss.process()
imout = img.copy()
for e,result in enumerate(ssresults):
if e < 2000:
x,y,w,h = result
timage = imout[y:y+h,x:x+w]
resized = cv2.resize(timage, (224,224), interpolation = cv2.INTER_AREA)
img = np.expand_dims(resized, axis=0)
out= model_final.predict(img)
if out[0][0] > 0.70:
cv2.rectangle(imout, (x, y), (x+w, y+h), (0, 255, 0), 1, cv2.LINE_AA)

Now once we have created the model. We need to do prediction on that model. For that we need to follow the steps mentioned below :

  1. pass the image from selective search.
  2. pass all the result of the selective search to the model as input using model_final.predict(img).
  3. If the output of the model says the region to be a foreground image (i.e. airplane image) and if the confidence is above the defined threshold then create bounding box on the original image on the coordinate of the proposed region.
Image for post
Output of the model

As you can see above we created box on the proposed region in which the accuracy of the model was above 0.70. In this way we can do localisation on an image and perform object detection using R-CNN. This is how we implement an R-CNN architecture from scratch using keras.

You can get the fully implemented R-CNN from the link provided below.

No Comments

Leave A Comment