So in this video, Im going to give to you a clear and simple explanation on how Deep SORT works and why its so amazing compared to other models like Tracktor++ TrackRCNN and JDE But to understand how DeepSORT works we first have to go back, waaaaaay back to understand the fundamentals of object tracking and the key innovations that had to happen along the way for DeepSORT to emerge Before we get started, if you are interested in developing object tracking apps then checkout my course in the link down below where I show you how you can fuse the popular YOLOv4 with DeepSORT for robust and real-time applications. Okay so back to Object Tracking Now lets imagine that you that you are working for Space X and Mr Musk has tasked you with ensuring that on launch the ground camera is always pointing at the falcon 9 as it thrust into the atmosphere As much as you are excited to be personally chosen by Elon to work on this task you ask yourself “How will I go about this?” Well given that you have a PTZ or pan tilt zoom camera aimed at the rocket you will need to implement a way to track the rocket and keep the the rocket at the center of the image So far so good…? Just note that if you do not track it properly your PTZ motion will stray off the target and you’ll end up with a really disappointed Musk And you cannot screw this up because this is your first job and you really want Elon Musk to be impressed I mean who wouldn’t want to, right? Soo, Question… how will you track the rocket? Well you might say “well ritz, you did a whole tutorial series on object detection” “why don’t we just track by detection” “you know umm using something like YOLOv4 or Detectron2?” Hahaha.. Okay okay.. lets see what happens if we use this method. So the Falcon 9 launches on a day with a clear blue skies you are armed with the state of the art detection models for centralizing the camera on rocket Everything going well~ until all of a sudden a stray pigeon swoops in front of the camera occluding the rocket from your camera and just like that the rocket is out of sight …The boss is not happy Deep down inside you feel your heart sink and your soul crushed by the disappointment But you light up some greens he chills out and after a smoke or two he decides to give you another chance. The high has also given you a chance to reflect on why this did not work you conclude that while detection works great for single frames there needs to be a correlation of tracked features between sequential images of the video. Otherwise any sort of occlusion, you will lose detection and Your target may slip out of the frame. So you dig a little deeper in attempts as to not disappoint Mr. Musk again you go back to traditional method such as mean shift and optical flow Starting with mean shift you find out that it works by taking our object of interest which you can visualize as a blob of pixels, so not just location, but also size. So in this case the falcon 9 rocket that we are detecting is our blob. Then you go to next frame and you search within a larger region of interest known as the neighborhood, for the same blob. You’ll want to find the best blob of pixels or features in the next frame that best represent our rocket by maximizing the similarity function. This strategy makes a lot of sense If your dog goes missing, you wont just drive to the countryside but instead start with searching your immediate neighborhood for your best friend Unless course you have a dog like Lassie. In that case, she’ll find you. The other tool you look into is optical flow which looks at the motion of features due to the relative motion across frames between the scene and camera. So say for example you have your rocket in your image, and it moves up in the image, you will be able to estimate the motion vectors in frame 2 relative to frame 1. Now if your object is moving at a certain velocity, you will be able to use these motion vectors to track and even predict the trajectory of the object in the next frame. A popular Optical Flow model that you could use for this is Lucas Kanaader.. Kanada? Kanade? Cool so now you’v got another shot at impressing Mr. Musk he was only a little annoyed… thats right.. only a little annoyed .. that you lost his rocket.. So to save Elon a buck or 2, you decide to model this in simulation and test the viability of Optical flow and Mean Shift You find out some interesting things from this experiment. After running your simulations you discover that while the traditional methods have good tracking performance, they however are computational complex and prone to noise in the case of optical flow And for mean shift, its unreliable if the object happens to go beyond the neighborhood region of interest So Move too fast, loose the track. And that’s not even considering any type of significant occlusion. So as much as you want to show this off to Mr. Musk you have a gut feeling telling you that you can do better.. way better!! You go to your shrine and meditate for a bit, Spend some time crunching the numbers and reasons why you were better off working somewhere else. But You stumble across an amazing technic used almost everywhere known as the Kalman Filter. Now I have a whole video on what the Kalman filter is and how you can use it catch pokemon But essentially its premise Is: say you are tracking a ball rolling in 1 dimension You can easily detect it within each frame. That detection is your input signal which you can rely on as long as there is a clear line of sight to the ball, with very low noise. Now during detection, you decide to simulate cloudy conditions using that fog machine you used at the last office party. You can still see the ball but now your vision sensor has noise in it, thus decreasing the confidence of where the ball is Now Lets make it a bit more complex and throw in another scenario where the ball travels behind a box which occludes the ball. How do you track something that you can’t see? Well this is where the Kalman comes in. Assuming a constant velocity model and gaussian distribution You can guestimate where the ball is based on the model of it motion. When the ball is able to be seen, you rely more on the sensor data and thus put more weight on it. When it is partially occluded, You can place weight or reliance on both motion and sensor measurement data And if its fully occluded. You will shift a lot of weight on motion data. And the best part of the Kalman filter is that it is recursively, meaning where we take current readings, to predict the current state, then use the measurements and update our predictions Now ofcourse there is a lot more to the Kalman filter to cover in just one video. But by now you probably wondering, “Ritz, the title of this video is on DeepSORT..” “what are you going on about Kalman filters and traditional tracking algorithms from the good ol days?!” “What going on here man!?” Hold up, hold up, hold up we are getting there, just bare with me The Kalman filter is a crucial components in DeepSORT. Let’s Explore why. The next launch is coming up soon where multiple Projectiles may be need to be tracked so you are required to find a way for your camera to track your designated rocket. The Kalman filter looks promising, but your Kalman filter alone may not be enough. Enter SORT Simple Online Realtime Tracking You learn that SORT comprises of 4 core components which are: 1.Detection, 2.Estimation, 3.Association, 4.And Track Identity creation and destruction. Hmmm, This is where is all starts come together You start with detection So as you’ve learn earlier that detection by itself is not enough for tracking. However the quality of detections has a significant impact on tracking performance. Bewely et. al. use FRCNN(VGG16) back in 2016 now you can even you YOLOv4 in according implementation we use YOLOv4 which you can check out in the lnk down below Estimation So we got detections now what the f*** do we do with them? So now we need to propagate the detections from the current frame to the next using a linear constant velocity model. Remember the homework you did earlier on the Kalman filter? yes that time was not wasted. When a detection is associated to a target, the detected bounding box is used to update the target state where the velocity components are optimally solved via the Kalman filter framework. However if no detection is associated tot the target, its state is simply predicted without correct using the Linear velocity model. Target Association In assigning detections to existing targets, each target’s bounding box geometry is estimated by predicting its new location in the latest frame. The assignment cost matrix is then computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets. The assignment is solved optimally using the Hungarian algorithm. This works particularly well when one target occludes another. In your face Swooping Pigeon!! Track Identities life Cycle When objects enter and leave the image, unique identities need to be created or destroyed accordingly For creating trackers, we consider any detection with an overlap less than IOUmin to signify the existence of an untracked object. The tracker is initialized using the geometry of the bounding box with the velocity set to zero Since the velocity is unobserved at this point the covariance of the velocity component is initialized with large values, reflecting this uncertainty Additionally, the new tracker then undergoes a probationary period where the target needs to be associated with detections to accumulate enough evidence in order to prevent tracking of false positives. Tracks are terminated if they are not detected for TLost frames you can specify what the amount of frame for TLost Should an object reappear, tracking will implicitly resume under a new identity. Wow, you are absolutely on fire now with all this SORT power consuming you, you power up even more, surging power level over 9000 screaming until you transform from SORT to your ultimate form DeepSORT Super Sayans be proud Now if you’re almost there So now you explore your new found powers and learn what separates SORT from the upgraded DeepSORT So in SORT we learnt that we use a CNN for detection but what makes DeepSORT so different? If we analyze the full title of which is Simple Online and Real time Tracking or SORT withwith a deep association metric. “Hhmm okay Ritz..” I really hope you are going to explain what deep association metric is We’ll discuss this in the next video.. hahahh.. just kidding I cant leave you hanging like that. Especially when we are so close to completing the project for the falcon 9 launch. Okay so Where is the deep learning in all of this? Well, we have an object detector that provides us detections, the almighty Kalman filter tracking it and giving us missing tracks, the Hungarian algorithm associates detections to tracked objects. You ask: “o, is deep learning really required here?” Well while SORT achieves an overall good performance in terms of tracking precision and accuracy, also despite the effectiveness of Kalman filter it returns a relatively high number of identity switches and has a deficiency in tracking through occlusions and different viewpoints So, to improve this, the authors of DeepSORT introduced another distance metric based on the “appearance” of the object. The Appearance feature Vector So a classifier is build based on our dataset which is trained meticulously until it achieves a reasonably good accuracy. Then we take this network and strip the final classification layer leaving behind a dense layer that produces a single feature vector, waiting to be classified. This feature vector is known as the appearance descriptor. Now how this works is that after the appearance descriptor is obtained the authors, use nearest neighbor queries in the visual appearance and this is to establish the measurement-to-track association “Measurement-to-track association” or MTA is the process of determining the relation between a measurement and an existing track So now we also use the Mahalanobis distance as oppose to the Euclidean distance for MTA. So while tensions are mounting, on the dawn of the launch day You quickly run your simulation and you find the Deep extension to the SORT algorithm shows a reduced number of identity switches by 45% achieved an over competitive performance at high frame rates. Just like that you find yourself standing alongside Elon in the bunker moments before the commencement of the launch You clench your fists, you feel the sweat on your brow heart beating, saying: “This is it.. this is the moment of truth” Elon raises the same question that you have on your mind “So.. will it work?” You stemmer a little.. but you answer with a confident “Im sure it will” Elon looks forward as the countdown begins 3… 2… 1… We have lift off!! You PTZ camera is set on the target on the target as the rocket lifts up from the ground… So far so good we have track. However, the rocket is passing through some clouds that partially occluding the target. The camera is still targeting the deepsort model is holding up quite well. Actually very well, as you notice as the swooping pigeon occluded the camera on multiple occasions without hinderance to the tracker. YES!!! Mission Accomplished Elon looks at you and extends his hand outwards to shake yours and says: “Well done, that was quite impressive.” You can now relax and pop some champagne with the team Job well done! That was quite an adventure for which you have learnt about object tracking, particularly on the DeepSORT model. Just out of curiosity you search the net for DeepSORT alternatives and you create a quick comparison You find 3 which are: 1. Tracktor++ which is pretty accurate, but one big drawback is that it is not viable for real-time tracking. Results show an average execution of 3 FPS. If real-time execution is not a concern, this is a great contender. 2. TrackR-CNN, which is nice because it provides segmentation as a bonus. But as with Tracktor++, it is hard to utilize for real-time tracking having an average execution of 1.6 FPS. JDE displayed decent performance of 12 FPS on average. It is important to note that the input size for the model is 1088×608 so accordingly, we should expect JDE to reach a lower FPS if the model is trained on Full HD Nevertheless, it has great accuracy and should be a good selection. Deep SORT is the fastest of the bunch, thanks to its simplicity. It produced 16 FPS on average while maintaining good accuracy, definitely making it a solid choice for multiple object detection and tracking. If you guys enjoyed this video please like, share and subscribe Comment on whether you would use DeepSORT for your own object tracking applications And if you want to learn how to implement DeepSort with the Robust YOLOv4 model then click the link below to enroll in our yolov4 PRO Course. Thank you for watching and we’ll see you in the next video.