Abstract:In response to the issues of low tracking accuracy in video sequences due to factors such as appearance changes, background clutter, and severe occlusions, a novel two-stage adaptive tracking model is proposed. This model includes two phases: target detection and bounding box estimation. In the target detection phase,the model roughly locates the target; in the bounding box estimation phase, the exact position of the target is determined. To address the complexity of video scenes and the challenges of tracking small targets, multi-feature fusion technology is employed to construct a rich target representation. Experimental results show that compared with models such as Simple Online and Realtime Tracking(SORT), Tracktor++, FairMOT, and Transformer, this model demonstrates the best overall performance, effectively balancing the relationship between computational speed and tracking accuracy, and showing good potential for application.