INVESTIGATION OF CONVOLUTIONAL NEURAL NETWORKS FOR VISUAL TRACKING OF PEDESTRIANS.

The problem of human detection in an image or video sequence is still a hot topic nowadays. It has been actively researched and still, the accurate and fast detection remains an issue. This paper aims to provide additional insights into existing solutions for pedestrian detection. The proposed method is to only use a part of video frames for object detection showed that it is possible to receive 88 % processing speed increase without accuracy lost when using every second frame. However, skipping more frames introduces tracking latency of approximated location of a pedestrian.

It seems natural to focus on images captured by one's eyes or cameras because they provide a lot of information about the surrounding world. It can be both the goods on the shelf or people walking down the street. Our brain processes such images without much mental effort, but it is a much more difficult task to be transferred to computers. In recent years, solutions to such problem have been greatly improved by Convolutional Neural Networks (CNN). Every year state-of-the-art results are achieved in object detection tasks, but it comes with a price -there are much more calculations to be performed than in classical object detection algorithms.
In the recent years, excellent results have been achieved by quantizing the weights of CNN or by measuring them using low precision numbers (Park et al. 2017; Hubara et al. 2016) as well as changing the corresponding number of layers making the CNN model smaller capable to work even with mobile devices, but only marginally less effective, such as SqueezeNet (Iandola et al. 2016). Our proposed method is to only use part of video frames for object detection in order to gain object tracking speed. Pedestrians are selected as an experiment object.
An overview of tools, data and methods used, analysis of the results of the system with the changes made and the resulting conclusions are also presented.

Materials and Methods:-
The chosen parameter to define the speed of object tracking algorithm's is a number of processed frames per second (FPS).
Objects in the image can be marked rectangular called bounding boxes. If this rectangle defines an object accurately, it can be called a ground truth (GT). An estimate (E) is a bounding box surrounding rectangle which is produced by an object detection algorithm. Two parameters describing the recognition quality independently of the ground truth 669 (GT) and measurement (E) shape, are recall and precision (Smith et al. 2005). A recall is used to describe how much of the measuring rectangle covers the ground truth area. It might happen that, despite the high parameter value, object tracking results will not be satisfactory. A whole reference rectangle could be covered by a measuring rectangle, but only a part of the measuring rectangle would be used to cover the object (Fig. 1). The formula for this coefficient : herean area covered by a measure, area covered by ground truth. The precision describes how well the measured part of the rectangle covers the reference rectangle. It is highly probable that a high coefficient would not guarantee high tracking quality ( Fig. 1). Although the entire measuring rectangle might be used to cover the object, it would not cover the entire object. This coefficient can be calculated as: Developers of MOTChallenge (Milan et al. 2016) distinguish two requirements for object tracking in a video sequence. The first onefor each recognizable object, it is necessary to determine whether the recognized object is classified as true positive (TP) or false positive (FP). It should also be noted if the object in the image was not detected where necessary (false negative, FN). The second requirement is that if the object is detected after a video frame where an object hasn't been detected, tracking algorithm should make sure it receives the same unique identifier that was used before. The loss of an identifier would increase the number of incorrectly guessed (FP) and non-recognized objects (FNs).
The aforementioned parameters (TP, FP, and FN) can be used to calculate mean average precision (mAP). The ratio of intersection over union in a successful object detection is considered to be at least 0.5.
Object tracking quality is described by multiple object tracking precision (MOTP) (Milan et al. 2016;Bernardin et al. 2006). It is evaluated as the average of the object's localization error: here correctly guessed number of objects in the frame, is the intersection between the bounding box that defines the object and the real bounding box presented in the data.
here t is the frame number, GT t is the number of all objects to be detected, IDSW t is the number of tracking identifier switches. It is possible for error count to be bigger than the count of objects to be recognized, so this parameter is considered to be in the range (-∞, 100) when using percentages. 670 Multi-Object Detection Benchmark (2D MOT-15) video files are used for testing (Milan et al. 2016). Experiments are therefore performed for images of a different size.
A usual object tracking system consists of three parts -object detection, filtering used to dynamically update tracking coordinates and tracking coordinate assignments to relevant objects. Each of these three parts may consist of different algorithms which would allow the whole system to achieve the desired result. However, for the optimized system a particular set of them is used. Detection is based on the YOLOv2 machine training model (Redmon & Farhadi 2016;Redmon et al. 2015), Kalman filter is used for object tracking, and the problem of identifier assignment is solved by the Kuhn-Munkres algorithm (the complexity of calculations O(n 3 )) also known as Hungarian algorithm (Munkres 1957). Thus the detector indicates the coordinates of a detected object, the Kalman filter ensures that the object found between the frames is more resistant to high-frequency noise, while the Hungarian algorithm tries to ensure that a particular object retains its originally assigned identifier.
A paper describing a combination of a Kalman filter and Hungarian algorithm called SORT (Simple Online Realtime Tracking) has recently been published and showed promising results (Bewley et al. 2016). The advantage of this combination is a high speed -tracking is not loaded with attempts to solve extreme cases that slow down the algorithm, even if they make them more resistant to errors. This problem is considered to be self-solving because there is a steady improvement in the quality and speed of recognition algorithms. In order to make the object detection more resistant to noise, the results of tracking are stored for some time before they are deleted.
The experimental research uses a computer with eight core AMD Ryzen 7 1700 3.7 GHz processor, 32 GB of RAM, ASUS ROG Strix GeForce GTX1080Ti 11 GB RAM GAMING graphic processor. The Python programming language and the OpenCV library have been used to write the code, and some parts of it are written in C and C ++ to achieve higher algorithm performance.
Training is terminated after 45 thousand iterations. Then the best results are returned to the validation data. The training is carried out in 416 × 416 images, and then the resolution of an image is increased to 608 × 608. This allows better detection of small objects and a higher detection result. Training is done five times, 6000 images of the same database are used for validation. The best achieved average detection accuracy is 77.81%. The initial parameters for pedestrian tracking are given in Table 1. It is seen that the most of calculation time is used for the detection of the object and the difference between it and other parts is measured several tenths of times. It should also be noted that a relatively high tracking quality is available for small images (640 × 480). Lower quality 671 is obtained by tracking subjects in large video frames or where one side of the video frame is longer than the other (KITTI-17). It can also be seen (according to the MOTP) that the accuracy of the object localization is about 70%, but there is a rather high number of errors that the MOTA parameter evaluates.

Results of the Experimental Investigation:-
Experiments start by using only every second, third or fourth frame for pedestrian detection. Metrics of multiple object detection and their standard deviations are represented in Table 2. Fig.2-3 show how a speed of YOLOv2 object tracking algorithm is dependent on using only a part of frames.  Aforementioned results show that using only one of several frames for object detection greatly decreases computational complexity. However, it is worth mentioning that every skipped frame has its cost -pedestrian tracking algorithm gets less robust, localization is performed slightly worse (lower recall and precision), more detection and identification mistakes are made (lower MOTA metric).
A different approach to skipping frames for detection can be used. It is possible to only skip part of the frames. Experiments skipping every third and fourth have been carried out and their results are represented in Table 3 and    The aforementioned results in Table 3 and Fig.4-5 show that skipping some frames enable a speedup of an object tracking algorithm. A decrease of computational complexity introduces virtually no penalty on multiple object tracking metrics.

Conclusions:-
An increase of speed for convolution neural network can be done by changing parts of network's architecture but finding an optimal solution could be a highly complex task. Skipping an object detection part of an algorithm can be used to speed multiple object tracking algorithm without much loss of tracking accuracy.
The speedup of a tracking algorithm is linearly dependent on skipped frames, but using only every second frame doesn't make algorithm two times faster. Around 10-12 percent of initial computation power is spent on parts other than pedestrian detection (Kalman filter and Hungarian algorithm).