Filtering Non-Actionable Static Analysis Warnings with Sentence Transformers: STAF Architecture and Cross-Project Performance

What the paper studies

Between 35% and 91% of warnings produced by static code analysis (SCA) tools are non-actionable — a range documented across multiple prior studies and tools. Non-actionable includes both false positives, where the tool flags something that is not an issue, and true positives that developers deliberately deprioritize because the finding is irrelevant to the current development context. At sufficient volume, this noise causes alert fatigue: warnings accumulate faster than developers can triage them, and the SCA tool gets ignored entirely. This paper introduces STAF (Sentence Transformer-based Actionability Filtering), a machine learning classifier that processes the three core artifacts present in any SCA report — the warning message, the specific flagged line of code, and the surrounding source code context — and labels each finding as actionable or non-actionable. The goal is a shorter, higher-precision list of issues that developers address.

Methodology

The evaluation uses a dataset of more than one million SCA reports from Java projects, compiled by Kószó et al. (2025). Java was selected because it has the densest published literature on SCA warning filtering, enabling direct comparison with existing methods. Reports were generated by tools including SonarQube, SpotBugs, and PMD.

Each report provides three inputs to STAF. The warning message is the natural-language description the SCA tool emits for the finding. The warning line is the specific line of source code the tool flagged. The source code context is the set of lines immediately before and after the warning line. These three inputs match what a developer checks when triaging a warning manually.

STAF has two embedding components. The Warning Line Embedder (WLE) processes the flagged line using a RoFormer — a BERT-style autoencoding transformer with rotary position embeddings — trained from scratch on the task. RoFormer’s rotary embeddings improve classification on longer token sequences, relevant for flagged lines containing full method signatures or complex expressions. The WLE uses InCoder’s tokenizer, a code-specific tokenizer that compresses source code tokens more efficiently than general-purpose tokenizers. The chosen WLE configuration: two transformer layers, 512 hidden dimensions, eight attention heads, and a dropout rate of 0.2 to limit overfitting. The Context Embedder (CE) handles the warning message and surrounding source lines using mGTE, a multilingual sentence transformer with 305 million parameters that produces 768-dimensional embeddings. mGTE was chosen for its results on the MTEB benchmark’s coding and general-language tasks and for its inference efficiency — it runs without specialized GPU hardware on inputs up to 8,192 tokens. A feature fusion component aggregates the WLE and CE outputs before a binary classification head.

Pre-processing strips single-line suppression comments — such as // NOSONAR in SonarQube — from warning lines before embedding. Without this step, the model would learn to classify a comment-suppressed line as non-actionable from the comment text alone, a shortcut that does not generalize to unsuppressed lines.

The evaluation covers two setups. In the within-project setting, training and test data come from the same project, split by individual warning instance. The authors note this setup is susceptible to favorable random splits, so a 5-fold cross-validation supplements the primary evaluation. In the cross-project setting, the model is trained on one set of Java projects and applied to held-out projects it has never seen during training. Three baselines provide comparison: DeepInfer (a CodeBERTa-based classifier fine-tuned on Infer analyzer warnings), PRISM (an ensemble of Code Representation Learning models combining Word2Vec and FastText embeddings with BiLSTM networks via a voting strategy), and ChatGPT (applied via prompt engineering to classify each warning without fine-tuning). An ablation study removes the pre-context and post-context source code embeddings to measure their isolated contribution.

Findings

Within-project evaluation: F1=89%, MCC=0.87, exceeding all baselines by at least 11 pp in F1 and 4 pp in MCC. Five-fold cross-validation confirms these numbers hold across different training-test splits. MCC is the more informative metric here: the dataset has class imbalance between actionable and non-actionable findings, and a model predicting the majority class on every warning would achieve high accuracy but MCC near zero. An MCC of 0.87 means STAF’s predictions are strongly correlated with the true labels across both the actionable and non-actionable classes — the classifier is not just recovering the majority class.

Cross-project generalization: at least 16 pp F1 lead, at least 9 pp MCC lead. When applied to projects not seen during training, STAF’s absolute performance declines compared to the within-project setting — expected behavior when encountering new codebases and new warning type distributions. STAF’s cross-project F1 and MCC still exceed all three baselines by at least 16 pp and 9 pp respectively. The cross-project margin over baselines is wider than the within-project margin, indicating that the baselines degrade more steeply than STAF when applied to unseen projects. mGTE was pre-trained on multilingual text and code before STAF fine-tuning, giving it embeddings that generalize to Java projects with different codebases and warning distributions. PRISM’s Word2Vec and FastText vectors were trained on the training-set vocabulary; DeepInfer was fine-tuned on Infer-specific warnings. Both fit more tightly to the projects they trained on.

Source code context is required: removing it drops F1 by 10 pp and MCC by 11 pp. The ablation study removes pre-context and post-context source code embeddings from the CE, leaving only the warning message and the flagged line as inputs. F1 drops by 10 percentage points and MCC by 11 points. The paper provides a concrete illustration: two Java warnings with identical messages — both suggesting replacement of explicit type arguments with the diamond operator <> — differ in actionability because one appears inside an anonymous class definition where the diamond operator triggers type inference failures in many Java environments. Without surrounding lines, no classifier using only the warning message and flagged line can separate these two cases. At the scale of the dataset — more than one million reports — context-dependent warnings like this are numerous enough that the 10 pp and 11 pp drops reflect a systematic gap, not isolated edge cases.

Alert fatigue is pervasive across tools and projects. Research cited in the paper documents non-actionable warning rates between 35% and 91% depending on the SCA tool and project. Even genuinely detected issues — true positives — become non-actionable when the code context makes the suggested fix inappropriate. This means STAF learns two distinct things from labeled triage data: which warnings are false positives, and which warnings reflect real issues that the team has decided not to fix given the current codebase state. Both categories reduce the warning list; only the first involves the tool being wrong. This range reflects genuine variation in tool precision and project-specific warning distributions.

mGTE delivers competitive embeddings at 305M parameters without specialized hardware. The choice of mGTE as the CE backbone reflects an explicit engineering constraint: the model fits within standard server memory budgets, produces 768-dimensional embeddings competitive on the MTEB coding benchmark, and processes inputs up to 8,192 tokens. Prior SCA filtering approaches requiring backward program slice computation are computationally infeasible for large codebases — a backward slice traces all statements in the control and data flow graph that could influence the flagged variable, requiring a full traversal of the program’s dependency graph for each warning. Approaches depending on software change history or user review data may be unavailable or conflict with privacy requirements. STAF’s architecture removes both dependencies by embedding nearby source lines rather than computing program structure.

Limitations

The paper’s evaluation covers Java exclusively. Java was selected because it has the largest body of published SCA filtering research, enabling rigorous comparison, but SCA tools for Python, Go, TypeScript, and Rust produce reports with different message structures, different false-positive profiles, and different code context conventions. Whether STAF’s sentence-embedding approach transfers to those languages without retraining on language-specific labeled data is not tested.

The within-project setting requires a labeled dataset from the target project — historical records of which warnings developers acted on versus closed without fixing. Projects without this history cannot immediately benefit from within-project STAF and need to collect labels before fine-tuning. The cross-project model provides a usable starting point, but the paper reports the cross-project margin over baselines rather than the absolute cross-project F1 score. Teams cannot set absolute accuracy expectations from the published numbers alone.

The suppression comment removal step assumes those comments are consistent indicators of deliberate non-actionability decisions. In codebases where comments like // NOSONAR are applied inconsistently — added to silence noise rather than to record a considered triage decision — this pre-processing step may introduce training bias.

The dataset is drawn from open-source Java projects. Closed-source enterprise codebases may have substantially different warning distributions, different code generation conventions, and different developer triage practices not represented in the training data. Enterprise Java projects often use code generators — Lombok annotations, JAXB-generated classes, or framework scaffolding — that produce warning patterns absent from the open-source training set. Organization-wide suppression conventions may also preemptively label entire warning categories as non-actionable in ways not reflected in the labeled data.

Finally, ChatGPT is used as a baseline via prompt engineering rather than fine-tuning. A fine-tuned large language model might perform differently, though the paper’s primary engineering contribution is a lightweight classifier that does not require inference against a large hosted model for every warning.

Real-world application

Start capturing triage labels now. STAF’s 89% F1 in within-project mode requires labeled examples of developer triage decisions. Most teams do not record this data systematically: SCA findings get closed in the issue tracker without any annotation distinguishing a fix from a dismissal. Instrument your CI pipeline or issue tracker to capture this distinction per finding. After 500 to 1,000 labeled examples per major warning category, a model like STAF can be fine-tuned to filter that category’s noise automatically. This labeling infrastructure is the concrete prerequisite for within-project accuracy.

Measure actionable warning rate, not warning count. The 35–91% non-actionable rate means raw warning count is not a useful metric for evaluating SCA tool value. Track the fraction of warnings that developers fix within a defined time window — the actionable warning rate — per tool, per project, and per warning category. Teams measuring only total warnings will consistently overestimate the value of high-volume, low-precision tools and underestimate the cost of alert fatigue on developer throughput.

Use cross-project STAF as an immediate floor for Java services. For teams managing multiple Java services without per-project labeled datasets, a pre-trained STAF model deployed in cross-project mode immediately outperforms PRISM and DeepInfer by at least 16 pp in F1. The practical deployment path: run the cross-project model on SCA output today to produce a reduced warning list; collect developer feedback on that filtered list to build project-specific labels; use those labels to fine-tune toward within-project accuracy over time. This approach gets measurable value from the model before the labeling investment pays off.

Feed surrounding source lines to any SCA classifier. The ablation result — a 10 pp F1 drop from removing context embeddings — applies to any filtering approach, not just STAF. If your current SCA classifier or LLM-based code review tool classifies warnings using only the warning message or the flagged line, add pre-context and post-context lines and retrain or re-prompt. The STAF numbers provide a published benchmark to measure improvement against.

For security teams: verify recall on high-severity categories before deployment. STAF filters both false positives and deprioritized true positives. A filter trained to reduce alert fatigue will suppress some genuine security findings if those findings are consistently deprioritized in context. Before deploying any ML-based SCA filter in a compliance-sensitive pipeline, measure the filter’s recall on high-severity warning categories — null dereferences, injection patterns, deserialization issues — on a held-out labeled set, and establish a minimum acceptable recall threshold before the filter reaches production.

References

Aladics, T., Vándor, N., Ferenc, R., & Hegedűs, P. (2026). Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts. arXiv:2604.18525v1. https://arxiv.org/abs/2604.18525v1

Dataset: Kószó et al. (2025). Java SCA report dataset — more than 1 million labeled SCA warnings from open-source Java projects.

Tools and models cited: SonarQube, SpotBugs, PMD (SCA tools); mGTE — Zhang et al. (2024); InCoder tokenizer — Fried et al. (2022); RoFormer — Su et al. (2024); HuggingFace Transformers library; DeepInfer / CodeBERTa — Kharkar et al. (2022); PRISM — Yang et al. (2024); MTEB benchmark — Muennighoff et al. (2022).