Hsu, W. H. (1992). Generic Virus Detection. MacTech: The Journal of Macintosh Technology (formerly MacTutor: The Macintosh Programming Journal), 8(2).
Recently, computer viruses have attracted a high volume of public, media, and scientific attention. This is no surprise considering the explosion in the development rate of computer virus code. Combine this with the fact that current methods for detection of viruses have had limited success. A new approach to the detection of such code is needed. We're going to look at two variations on algorithms originally developed for string matching. Both modifications allow an increased tolerance for variant “strains” of known viruses, especially for the so-called evolutionary class which mutate themselves at predetermined intervals.
First, a randomization scheme can be applied to an established fast substring matching procedure (such as the Boyer-Moore algorithm). This randomization allows mutation-resistant searching. Second, an approximate pattern matching algorithm for a maximum number of differences can be used. The algorithm is modified by weighting the edit distance metric to make it robust to character padding and removal. Both functions are combined to create a generalized detector capable of finding viral clones, whether produced by human authors or by automatic variations.
Edit, 09:30 CST - That's not a refereed paper, which is why it doesn't appear on my publications list, but I could probably have gotten it into a security conference if I'd included some of the experiments with the real viruses of the day (
nVIRstrains). I was 16 when I did the project, 17 when I wrote it, and 18 when it was published, so I wasn't yet clear on what the standards of academic peer review were.
Edit, 10:35 CST - The subject is still timely, and I believe the methodology may be as well, albeit in limited cases. I've learned some things about string matching, randomized algorithms, and cryptography since then. The question in this case is whether string-matching approaches to virus detection still has any relevance, compared to cryptographic checksums and security-based techniques such as managing information flow and formally verifying and certifying properties.
I'm a strong believer in defense-in-depth, so insofar as formal "integrity shells" and secure kernels are not in place on every desktop, I think string matching scans can provide a first line of defense against viruses whose signature has not yet been broadcast (via Symantec AntiVirus LiveUpdate's "Avenge Microdefs" and McAfee's Anti-Virus). There's a critical window before a CERT team dissects the virus and people "get the fix".
Suppose every version of (say) Windows Vista had a comprehensive checksum-based kernel component that ran as a background process or thread and could be enabled from the Control Panel the way privacy options are set in Internet Explorer's Tools -> Internet Options. In that case, string-matching would be largely unnecessary and disk scanning could be based on signatures, leaving only the evolutionary viruses to contend with.
The problem is that, like IE's privacy options for cookies, etc., people generally either leave this at the default (medium-high?) or set it to the most convenient level they will need once they run into hassles. So even if Vista had a good trust model, people probably wouldn't use it much.