Epidemiology and computer viruses.
It has been suggested in the press that computer viruses spread at an
exponential rate; figures suggesting a doubling every two or three
months have been suggested. These figures tend to be arrived at by
fitting such a simple curve to two points, one of which is a rather
arbitrary point a few years ago, when it is supposed that only one
copy of one virus existed, and the other datum is an estimate of the
current position.
Statisticians are well aware of the danger of curve-fitting and
extrapolation from two (rather shaky) numbers; furthermore the
experience for biological viruses does not suggest a simple
exponential curve. There is a well-researched model for
epidemiological studies, and it has a strong justification.
First, let us consider the factors affecting the probability that any
given computer will be infected by a given virus. There are three
main influences on this probability.
1. The percentage of currently-infected individuals.
2. The readiness with which the virus under consideration can
replicate (called infectivity).
3. The degree to which the machine in question has contact with the
population of computers.
The percentage of currently-infected individuals depends on two
factors:
4. The rate at which computers are becoming infected.
5. The length of time that they stay infected.
These factors vary. Let us define them precisely. The first analysis
will assume that there is only one virus - later this will be
generalised.
The first variable, p, is the fraction of PCs that are infected with
the virus. Let us also define I as the probability that a PC will
become infected by "exposure" to another infected PC. Finally, D is
the probability that the virus is detected. We shall assume that ince
the virus is detected, it is eradicated from that system.
The rate of new infections is proportional to the number of infected
PCs, to the number of uninfected PCs and to the probability of
infection. The rate of infections being eradicated is proportional to
the number of infected PCs, and to the probability of detection.
dp/dt = p.(1-p).I - p.D (1)
Some interesting consequences of this model are as follows. First,
consider the situation of equilibrium, so that dp/dt = 0. If we plug
this into equation 2, we get p = 0 (no infections, so no spreading) or
p = 1 - D/I. This means that if D is greater than I, p tends to zero.
A virus will die out if it is more likely to be detected, than to
cause a new infection. We will call this equilibrium condition pmax =
1 - D/I.
This is, we think, what has happened to Brain virus. The probability
of detection is very large, as it announces itself on every infected
diskette by the volume label (c) Brain. Before people knew that this
meant a virus, it could spread unhindered, but now that so many PC
users are aware of the meaning of such a volume label, it means rapid
eradication of the outbreak. We now get very few reports of Brain
from most countries (India is an exception, but this is perhaps
because virus awareness is a relatively recent thing there.
If the probability of detection is 0.1 times the probability of
infection, then p tends to 1-D/I = 0.9. So a virus that is well
hidden will be most successful.
dp = p.I - p.p.I - p.D (2)
dt = dp/(p.(I-D-I.p)) (3)
We integrate by partial fractions:
1 /(p.(I-D-I.p)) = A/p + B/(I-D-I.p) (4)
= (A.(I-D-I.p) + B.p) / (p.(I-D-I.p)) (5)
So A = 1/(I-D)
B = A.I = I/(I-D)
Plugging these back into the differential equation, we get:
dt = (1/(I-D)).dp/p + (1/(I-D)).dp/(1-D/I-p) (6)
Using our definition of Pmax = 1 - D/I,
I.dt = (1/pmax).(dp/p) + (1/pmax).(dp/(pmax-p)) (7)
Now we integrate:
Pmax.I.(t - t0) = ln p - ln (pmax - p) (8)
= ln(p/(pmax-p))
p/(pmax-p) = exp(pmax.I.(t-t0)) (9)
This gives us a way to look at the situation at t = t0; then
p/(pmax-p) = 1, so p = 0.5 pmax.
>From (9), we can get
p = pmax.exp(pmax.I.(t-t0)) - p.exp(pmax.I.(t-t0)) (10)
p . (1 + exp(pmax.I.(t-t0)) ) = pmax . exp(pmax.I.(t-t0)) (11)
p = pmax . exp(pmax.I.(t-t0)) / (1 + exp(pmax.I.(t-t0)) ) (12)
Below, figures 1 to 3 show the proportion of computers infected for I
= 0.04, with D between 0.01 and 0.03. Figures 4 to 9 show the curves
with I = 0.1, giving D values between 0.01 and 0.09.
In the more complex case of multiple viruses, it is necessary to use
matrix algebra to track the infections of each virus, but the main
interaction between the viruses is that when any one virus is
discovered on a system, you must assume that all viruses on that
system will be removed. This means that there is a weak interaction
between the spread of the different viruses, but since multiple
infections are still not common, this is an effect that can be ignored
for the present.
A more important effect of multiple viruses, is on the probability of
detection. It is our experience that most people take no, or else
ineffective, precautions against viruses until they experience one
directly. At that point, they begin to take the problem more
seriously, and install one or more anti-virus systems. This has a
significant impact on the probability of detection, especially if the
anti-virus system is effective. Even if the anti virus system is only
partially effective, there must be some viruses that it can detect, so
value for D increases.
Thus, if one virus, has spread and been detected fairly widely, the
chances of another virus spreading so widely are severly diminished.
We can add this to the model above, as follows. Assume that the
probability of detecting a virus was D1 before, but as each computer
is infected, some virus detection software is installed. Assume that
the software is not perfect, and that the probability of detecting the
virus is changed from D1 to D2. Then, the average probability of
detecting a virus, averaged over the whole population of computers, is
increased as more of the computers have had contact with a virus.
But, in our experience, although some precautions are taken shortly
after a virus outbreak, it is often the case that these precautions
fail to identify another virus for some time (outbreaks being
relatively rare) and so the precautions fall into disuse. So, we can
model D as being related to the number of recent outbreaks; the
probability of detection is partway between the low probability (a
computer not running a virus detector), and the higher probability (a
computer that is running a detector), and the average probability is
the average of these two, weighted by the current number of infected
and uninfected computers.
D = p . D2 + (1-p) D1
So the differential equation describing the virus spread becomes:
dp/dt = p.(1-p).I - p.p.D2 - p.(1-p).D1 (13)
= p.(I-D1) - p.p.(I-D2-D1)
Again it is useful to look at the equilibrium condition. When dp/dt =
0, either p = 0 or p = (I-D1)/(I+D2-D1). Call this situation Pmax, as
before. If D1 = D2, this reduces to the same situation as before.
dt = dp/(p.(I-D1-p(I-D1+D2)) ) (14)
Using partial fractions again, we get
dt = 1/(I-D1) (dp/p) + 1/(I-D1) (dp/(pmax - p)) (15)
t - t0 = 1/(I-D1).ln(p) - 1/(I-D1).ln (pmax - p) (16)
(I-D1).(t-t0) = ln(p/(pmax-p)) (17)
p/(pmax-p) = exp(I-D1).(t-t0) (18)
p = pmax . exp(I-D1).(t-t0) - p . exp(I-D1).(t-t0) (19)
p = pmax . exp(I-D1).(t-t0) / (1 + exp(I-D1).(t-t0)) (20)
We have repeated the runs of the previous model, using I = 0.04 and D1
= 0.01, with values for D2 set to 0.01 (the previous case), 0.1 and
0.5. It is clear how the improved early detection reduces the number
of infections. This can be seen more dramatically with the other set
of runs, where I = 0.1, D1 = 0.01 and values of D2 from 0.01 to 0.9
are run.
Conclusions
Early detection is a very effective way to reduce the incidence of
viruses in a population of computers. Reducing the probability of
infection would also be useful, but this requires controls over the
flow of diskettes and files between the computers, and one of the
major advantages of computers is their ability to communicate
information. Obviously, the value of I can be reduced, but it is
clear from this model that the probability of detection plays a much
more crucial role in the spread of a virus. It is also noteworthy
that in a situation where early detection software is installed and
run for only a short time (as in these models) the number of computers
infected is dramatically reduced. If the virus detector is run for a
longer period of time after the virus infection, the effect would be
even greater.
One surprising outcome of this model, is that the virus detection
software does not have to be particularly effective. For example, we
found that if you raise the probability of detection from 0.01 to 0.5,
you get most of the benefits of running a detector. In practice, what
this is saying is that it takes you a bit longer to detect the virus
than an efficient detector would take, but nowhere near as long as it
would if you were not running any kind of detector.
The Lotus spreadsheets that were used to do the calculations for the
various runs of the model are available if you want to try plugging in
some other assumptions into the model, or to make the model more
elaborate.
Copyright (c) Alan Solomon, 1990. This paper may not be reproduced
in any form without written permission.
Dr Alan Solomon Day voice: +44 442 877877
Secure Computing Lab Eve voice: +44 494 724201
S & S International Fax: +44 442 877882
Mill Street, BBS: +44 494 724946
Berkhampsted, Fido node: 254/29
Herts, HP4 2HB Internet: drsolly@ibmpcug.co.uk
England Gold: 83:JNL246
CIX, CONNECT drsolly