Abstract:Large-scale net flow training data sets are inevitable requirements for building high-quality, highly stable network traffic classifiers. However, with the increase of the feature dimension of the network stream and the expansion of the data set size, neither the analysis processing of the network stream nor the training of the classifier model based on Support Vector Machin(SVM) can obtain effective processing results in effective time. A distributed and parallel large-scale network flow based on Hadoop cloud computing platform is proposed. Distributed learning and training of SVM network traffic classifier is implemented by MapReduce technology on Hadoop cloud computing platform, and CloudSVM network traffic classifier is constructed. Through the distributed storage and processing of trace files of large-scale network traffic from the campus network export mirror, the sample data sets are classified, and the distributed storage and parallel processing of large-scale network data based on Hadoop platform is experimentally verified. The high efficiency of the set also verifies that the CloudSVM classifier can quickly converge to the best without reducing the accuracy of the classification, and with the increase of large-scale network flow samples, the training time of the SVM classifier is approaching constant.