The Two Things Every WordPress Scanner Gets Wrong on Real Sites

Most malware scanners are tested on a fresh WordPress install with twelve plugins and a couple thousand files. That site takes forty seconds to scan. The screenshots in the product tour look great.

Real sites aren’t that site. A three-year-old agency build running WooCommerce and a page builder has 60,000 to 80,000 files before you count uploads, and shared hosting gives that scanner about ninety seconds of PHP execution time before the server kills it. You can watch the progress bar climb to 48,000 files and then stop, forever, because somewhere inside the scanner an AJAX request timed out and the JavaScript on the other end has no plan for what to do when that happens.

The second thing those scanners get wrong is more subtle. They flag a file as suspicious, hand you a pattern match and a confidence score, and that’s all you see. Nothing about whether anyone has actually tried to attack this file, whether it was touched during a login attempt from Belarus an hour ago, or whether the firewall already blocked seven requests aimed at it last week. You’re the one who has to clean it up, working from a fraction of the evidence that already exists on the same server.

These are the two problems I’ve spent the last few weeks fixing. Here is what they look like up close and what I ended up building.

The Scanner That Never Finishes

The textbook architecture for a web-based malware scanner is a loop. JavaScript in your admin panel asks the server “scan the next batch of files,” the server scans them and returns results, the JavaScript asks again. Simple, and it works beautifully on your laptop with 2,000 files.

On an 81,279-file site it breaks in seven different ways.

Timeouts are the obvious one. Shared hosting gives you ninety seconds, sometimes less. A batch of 100 files of mixed sizes can easily take longer than that, especially when one of them is a 4MB theme file with 600 regex rules being run against it. The server kills the request mid-scan and returns a 504 to the browser.

Memory is the next one. PHP’s memory limit on entry-level hosting is typically 128MB, and some of that is already spent on WordPress itself. Load a large file, run pattern matching against it, hold the results in an array, add another file, and at some point you hit the limit and the process dies.

Then there’s the browser tab. The entire scan lives inside a JavaScript state machine in a single tab. The user closes the tab to check email, the scan dies with it, and they come back an hour later to find the progress bar still sitting at 42%, assuming it is still going, when it actually died fifty minutes ago when they closed that tab.

The next four involve network flakiness, opcache misses, database lock contention, and one particularly fun case where a customer’s security plugin was rate-limiting the scanner’s own AJAX endpoint. I’ll spare you those.

What I built instead is a scanner that expects every one of these to happen.

Retry with backoff on transient errors

When a batch request fails with a 502, 503, 504, or a generic “request failed,” the client does not give up. It waits, retries, waits longer, retries again, up to five times. The backoff is exponential, so the retries do not pile on top of a struggling server.

// Client-side retry on transient scanner errors
const TRANSIENT = [0, 502, 503, 504];
const MAX_RETRIES = 5;

async function scanBatch(files, attempt = 0) {
    try {
        return await postChunk(files);
    } catch (err) {
        if (attempt >= MAX_RETRIES) throw err;
        if (!TRANSIENT.includes(err.status)) throw err;

        const delay = Math.min(30000, 1000 * Math.pow(2, attempt));
        await sleep(delay);
        return scanBatch(files, attempt + 1);
    }
}

Most transient errors recover on the second attempt. The fifth is there for the truly miserable shared host that drops every third request during backup hours.

Adaptive chunk sizing

A “batch” used to be a fixed number of files, say 100. That’s fine until you hit a directory full of minified JavaScript bundles and suddenly the batch takes four times longer than budget allowed. So I stopped shipping fixed batch sizes.

The scanner now tracks how long recent batches took and adjusts the next batch size in real time. A batch that ran in twelve seconds tells the server to try a larger one next. A batch that took sixty seconds tells the server to cut it in half. The floor is 1 file, because on broken hosts even that is sometimes all you can get through in a single request.

Resume across tab close

This is the one I’m proudest of. Every batch’s results are persisted to the browser’s sessionStorage immediately on receipt. If the tab closes, if the user navigates away, if the scan crashes, the next time they open the scanner page a banner appears:

Scan paused at 47,213 of 81,279 files.
[Resume Scan]  [Start Over]

They click resume, the scanner reads the last known batch offset from storage and picks up exactly where it left off. No rescanning the 47,213 files it already covered. No losing an hour of work because someone closed a tab.

Sub-scanner isolation

The old scanner did file scanning, database scanning, integrity checks, and vulnerability correlation all inside a single long-running process. If the integrity check tripped over a weird file permission, the whole pipeline collapsed and you got no results at all.

Now each sub-scanner runs as its own isolated request. They chain, but one failing does not take down the others. You might get file scan results plus database results plus a “vulnerability check failed, retry” notice, instead of nothing.

Those four changes, plus one more I’m not going to talk about yet, took scan completion rate on 80k-file sites from roughly 40% to roughly 99%. The remaining 1% are hosts so broken that nothing short of moving to a real server will save them.

The Finding That Knows Who Touched It

The other problem is context. A scanner finds a file with the pattern @eval($_POST["x"]) in it. The file is in wp-content/uploads/2024/11/.config.log.php. It is 39 bytes long, it has no other code in it, and it was last modified three hours ago. Is it malicious?

Yes, obviously. But walk through what you would actually want to see if you were not 100% sure: whether anyone has tried to reach that exact file from the outside, whether there were POST requests to it in the last hour, whether those requests came from an IP already blocked by the firewall for trying twelve other exploit patterns earlier today, whether a legitimate plugin writes to that path during normal operation. Every scanner I know shows you the first part (the pattern match) and none of the rest, and I spent a long time thinking about why.

The fix requires the scanner and the firewall to talk to each other, and most WordPress security plugins draw a hard line between those two things. The firewall runs on every request and has no persistent memory beyond a ring buffer of recent hits. The scanner runs on a schedule and has no idea what the firewall saw. They’re deliberately decoupled because coupling them is expensive.

The scanner and the firewall are sitting three tables apart in the same database and don’t talk to each other.

I decided expensive was fine, because the upside is that a “suspicious file” finding can now look like this:

wp-content/uploads/2024/11/.config.log.php

Pattern match: webshell (eval + POST), confidence 0.94
Last modified: 3 hours ago

Firewall activity on this path:
  12 POST requests in the last hour
  4 from IP 185.220.XXX.XXX (Tor exit node)
  3 from IP 45.146.XXX.XXX (already blocked for brute force)
  5 from IP 193.32.XXX.XXX (matches 7 known exploit signatures)

Correlation score: critical
Recommended action: quarantine immediately

The pattern match alone would have been enough to flag it. The pattern match plus twelve POST requests from three hostile IPs in the last hour is a different category of evidence. Instead of staring at a theoretical vulnerability, you are staring at an active fight, and the scanner can now give you the score of that fight.

How the fusion works without crushing shared hosting

The naive approach to this is to have the scanner query the firewall log every time it flags a file. That works on a six-file test site. On a real site with 45,000 firewall events per day and 80,000 files, you would be running 80,000 queries against a 45,000-row table, and the scan would never finish because we are back to Part One.

The approach that actually works is to flip it around. When the firewall sees a hostile request, it canonicalizes the target path once, hashes it, and writes a small compact record. At scan time, the scanner loads the full set of recent hostile paths into memory in a single query, hashes the path of each file it’s scanning, and checks the hash table. Zero database queries per file. The fusion happens in microseconds per file instead of milliseconds.

There are a few other tricks in there involving how paths get normalized across symlinks, how mod-rewritten URLs get mapped back to filesystem paths, and how events get aged out so the table doesn’t grow forever. Those details I’m going to keep for the engine.

Why I Did Both at Once

Scan resilience without correlation is a scanner that finishes, tells you about 47 suspicious files, and leaves you to sort them yourself. Correlation without resilience is a scanner that produces beautiful evidence on the 12% of sites where it actually completes.

Both at once is a scanner that gets through an 80,000-file site on a weak host, surfaces the 3 findings that actually matter out of the 47 that pattern-match, and tells you which ones are being actively probed by attackers right now. That is the baseline a real security tool needs to hit before anyone should trust its output.

There is a Part Three coming, and it is the reason I wrote the first two parts now instead of waiting. The next release caches a hash and a verdict for every file we’ve ever scanned, so the second scan skips 99% of the files the first scan already looked at. On the 81,279-file site, that takes the scan from roughly two hours to under ten minutes. The math works because 99% of files never change between scans, and we can prove that without re-reading them. More on how in a few weeks.

For now, if you’re running a large site on shared hosting and have ever watched a scanner lock up at 48,000 files and never come back, try the current release. It finishes the scan on real sites, surfaces what is being attacked, and remembers where it stopped when you close the tab.

That is the baseline a real security tool needs to hit before anyone should trust its output.
Nova Scan Engineering

~ SephX, Nova Heaven. Still building the scanner for the sites that need it most.