This paper reports a complete empirical study of language-guided feature selection for DDoS and intrusion detection on the CICIDS2017 MachineLearningCSV flow data. The central question is whether an LLM-style semantic reading of CICFlowMeter feature names can reduce the feature set while preserving detection performance and lowering false alarms. The experiment used the eight labeled CICIDS2017 CSV sessions, removed only non-finite numeric rows, and retained 2,827,876 flows with 78 original numeric features. A semantic feature screen selected 32 features describing service context, duration, packet and byte volume, flow rates, inter-arrival timing, TCP flags, window sizes, and active/idle behavior. The evaluation compared all features with the language-selected set under full-corpus binary and multiclass stochastic logistic regression, DDoS-specific Random Forest, DDoS-specific stochastic logistic regression, and a compact multilayer perceptron. The best DDoS result was obtained by Random Forest with the selected features: F1 = 0.999896, false-positive rate = 0.000068, and eight errors on 67,714 test flows. The selected features reduced the DDoS Random Forest training time by 23.78% and reduced full-corpus SGD training time by about one half, although the full feature set was stronger for the full binary linear model. Ablation showed that TCP flag/window and destination-port semantics produced the largest DDoS degradation when removed. The findings support language-guided feature selection as a practical compression step for latency-sensitive DDoS mitigation, while retaining all features remains advisable for broad multiclass intrusion detection when a linear learner is used.
Copyrights © 2025