- Published on
Exploring Exponential Moving Average
The Exponential Weighted Moving Average (EMA) is ubiquitous in signal processing, especially in the algorithmic trading scene. It serves to filter out noise from the signal and compute an average/expected value, with recent samples carrying more significant weight in the calculation. Although this is a fundamental tool, the underlying theory can be confusing. Its inner workings have many layers of understanding, which is worth revisiting from time-to-time. This post attempts to make things clear regarding the theory and practical usage of EMA. First, I introduce the most widespread (unadjusted) formula, and then we reveal its weaknesses. After that, the adjusted approach is presented, which should be favored over the other one. Finally, we examine the effective window size of the EMA, the so-called span formula.
1. Hunter’s EMA
The traditional and most commonly used EMA formula was conceived by Hunter (Hunter, 1986) for quality control purposes. In this regime, production is monitored by some well-crafted metric, describing the effectiveness of the process or the quality of the final product. If the metric is close to a desired pre-defined threshold, the process is said to be under control. Conversely, if the metric deviates from the threshold, the system signals manual intervention. Hunter recognized that using an EMA would suit this issue due to its beneficial traits: it effectively filters observations, ensuring the system is resilient against outliers while still being flexible and responsive to new data. He framed the problem as predicting the next EMA estimate from the current observation the following way:
where is the predicted new EMA, is the old EMA, is the observed value at time , is the observed error at time , and is a constant determining the depth of memory of the EMA. The crucial question is how do we initialize the recursive formula at time . The author provides an example that describes a typical initialization for the given use-case:
Let and suppose at a process thought to be under control has a target value . To initiate the EWMA, set the initial predicted value equal to the target value and …
This makes sense because the EMA is employed here to track the deviation from the target value. Next, we will attempt to apply this method generally on a series of observations beyond this specific use case.
Let be an ordered series of observations. We modify the notation to the following:
where is the exponentially weighted moving average at time ; and is the smoothing factor. The term decides the impact of the old EMA on the moving average. If is small, then becomes large, consequently has a large impact, while the new observation has a smaller influence on the moving average (large memory). If were one, the EMA would be equal to the observations. If were zero, then the EMA would not incorporate new observations and would solely output the initial value. This is why we limit this parameter to the range .
CAUTION
It is important to note a confusing aspect of this topic: some individuals use the EMA formula with an inverse parametrization interchangeably, frequently with the same naming, and there is no consensus about it:
Here, we will employ the first formula, Eq. , to stick to the original source.
Let’s solve the difference equation by guessing the general formula via manually calculating and “unrolling” it for a few values.
From this we can deduce the general formula:
where is the initial condition, and both indexing alternatives are shown. Remember that in the quality control application, this was set to the target value (Hunter, 1986). However, in general, it is unknown. The only thing we know at is the first observation . Thus, we initialize it with the first observation , which although introduces an initialization bias. Keep in mind that for large , the initial term loses significance as . Yet, for small , the EMA might become inaccurate and biased due to the unknown initial condition. Consider this: might be an outlier, deviating significantly from the actual and skewing the estimate. We will examine this issue in detail later.
Regardless, if is large we can approximate Eq. the following way :
where represents the exponential weights for each sample. We can think of this formula as a scalar vector product of the weight vector and the time series vector:
which is the vectorized form for large . We can also write the exact equation, Eq. , in this form utilizing the initialization:
where again . Note that the first term is differently weighted; we'll see that this is the weak point of the method. Before we address this issue, let’s plot the weight values in the function of the sample number for and .
In the figure, we can see exponentially decreasing weights assigned to earlier samples – as desired. The necessary condition of a weighted average is that the weights must sum up to one. Examine whether it holds for the EMA’s weights. We can do this by utilizing the summation formula for the geometric series (for more details on this see Appendix A1).
Does the weights sum up to asymptotically too, i.e. when ? In this case to first term approaches to zero, thus we have to inspect only the sum:
where we employed the geometric sum formula for infinite terms: when (see derivation in Appendix A1). We can conclude that the assigned weights are correct in the sense that they sum up to , both in the finite and asymptotic case.
So far everything seems to be correct, but if we visualize the weights for smaller values – corresponding to “longer memory” EMA – we get the following figure:
The first weight assigned to the initial term is highlighted in orange. Additionally, the percentage of the initial weight relative to the total sum of all weights is also denoted. We can see that the weight of the initial term starts “spiking” as we decrease – violating the exponentially decaying weighting scheme. To put it differently, the initial term receives greater emphasis, which is an error. The observed behaviour emerges from the fact that the initial term’s weight is not multiplied by , see Eq. . Although asymptotically , but for finite samples might be much bigger than for a sufficiently small . This the so called initialization bias.
Before proceeding with our current analysis, it is useful to introduce the concept and formula for the span, which we will examine in greater detail later. The span tells us the effective window size or memory associated with a given smoothing factor in the EMA. It establishes how much historical data is taken into account. The relationship between the span and is given by:
where denotes the span, representing the number of recent observations that effectively contribute to the EMA. For example, in the first row of Fig. 2., the span is set to , and we can observe that the weights rapidly diminish after the th observation. The last row of the figure shows the case where the span equals the number of observations – and where the initialization term holds significant influence, accounting for nearly of the total weight. When the span is half the number of observations, the influence of the initialization term is reduced, accounting for approximately of the total weight. We can see that the initialization bias becomes more pronounced as the span size approaches the number of observations.
IMPORTANT
The initialization bias is prominent when the number of observations is close to or below the span size.
It would be useful to determine when the influence of the initial term becomes negligible. This defines a so-called warmup period for the EMA, during which estimates should be discarded due to initialization bias.
We can say that after observations, the initial term becomes negligible when
where is a small threshold expressing the impact for the initial term (e.g. for influence). By solving for , we get:
Next, we substitute the span formula Eq. in and rearrange:
This formula defines the warmup period for a give span and . If we plot this with and normalize the warmup period with the given span size, we get the following figure:
which shows an increasing tendency for smaller spans and saturates around a value of after a span size of approximately . We can conclude that a warmup period of approximately times the span size is required to reduce the initialization term’s weight to less than of the total influence. Although in practice, a longer warmup period is often advised, with a general rule of thumb being 3-5 times the span.
IMPORTANT
Our findings suggest that a warmup period of approximately times the span is needed to reduce the impact of the initialization bias to approx. .
Consequently, for large span sizes, a significant amount of data would need to be discarded, as the EMA remains heavily corrupted by the initial term for up to three times the span length, leading to suboptimal results. Therefore, I suggest avoiding this method altogether in favor of using the adjusted formula. There is little reason to continue using this formula, as we will soon see, yet it remains ingrained in public consciousness. We will explore the adjusted formula in the next section.
2. The Adjusted Formula
Start with the general formula of a weighted moving average:
where the weights sum up to by definition. It has a weighted sum in the numerator and the cumulative weights in the denominator. Note that in this formula we sum the observations starting at the latest and going back in time: . If we substitute into the above equation we get the adjusted EMA formula which behaves better for small sample sizes:
The difference between and is that the former deals with a series which have finite history. The adjusted EMA provides the correct exponentially decaying weights even within the span period. If we consider an infinite series, then returns the original difference equation as the denominator contains a geometric series (see Appendix A1):
so in the asymptotic case we return back to the initial definition Eq. – meaning that the two methods converge for large sample sizes and become equivalent. We should mention that although the adjusted method improves early-stage accuracy by correctly weighting observations, it still suffers from inaccuracies within the span due to the finite data history at the start. Writing the adjusted formula in vectorized form we get:
where and is the cumulative sum up to time . If we produce the equivalent plot to Fig.2. for the new method, we get the following:
We can see that now the weights perfectly follow the exponentially decreasing scheme even for cases when the span size approaches the number of observations.
As the final blow to the old method, we will show that the adjusted method can also be written recursively. This means that it can be implemented efficiently and is suitable for online processing, just like the traditional method. If we inspect Eq. we may realize that both the nominator and denominator can be written in recursive form:
where we can see the recursion within the square brackets. Consequently, we can write the following:
where denotes the nominator and the denominator. A didactical Python implementation for both the traditional and the adjusted method would look like as follows:
def ema(x, alpha):
n = len(x)
y = np.zeros(n) # the output ema container
y_t = x[0] # Initialization
for t in range(1, n):
y_t = alpha * x[t] + (1 - alpha) * y_t
y[t] = y_t
return y
def ema_adjusted(x, alpha):
n = len(x)
y = np.zeros(n) # the output ema container
u_t = x[0]
v_t = 1
y[0] = u_t/v_t
for t in range(1, n):
u_t = x[t] + (1 - alpha) * u_t
v_t = 1 + (1 - alpha) * v_t
y[t] = u_t/v_t
return y
Quite straightforward, isn’t it? You can’t really argue that the unadjusted EMA is significantly simpler, more elegant, or faster. It’s time to move on from it. The adjusted EMA calculates the weighted sum u_t
and the cumulative weightsv_t
, then simply divides the two. Note that the EMA implementation of Pandas employs the adjusted formula by default (adjust=true
).
3. The Span
It would be useful if we could determine the effective number of recent samples contributing to the EMA – similar to the simple moving average (SMA), where the moving window parameter explicitly determines this.
The span is defined as the equivalent window length of a simple moving average (SMA) that would produce approximately the same smoothing as the EMA with a given . The rule of thumb is that the span can be related to by the following formula, as we already mentioned:
If we plot the SMA of a series with window
size and the corresponding EMA with span
size , that would result in a similar level of smoothing:
y_sma = random_series.rolling(window=10).mean()
y_ema = random_series.ewm(span=10).mean() # adjust=False computes the recursive formula
A more precise definition of the span is that, for a linearly increasing time series, the EMA will produce the same output as the SMA (after a warmup):
# In case of a linear series the EMA and SMA results are the same after warmup
linear_series = pd.Series(np.arange(1000))
y_sma = linear_series.rolling(window=10).mean()
y_ema = linear_series.ewm(span=10).mean()
warmup = 10*6
print(np.allclose(y_sma[warmup:], y_ema[warmup:]))
# > True
This code prints True
.
These examples show that the formula works, but a more quantitative derivation would foster a deeper understanding. Strangely, I’ve not found much information on this online, so it will be helpful to put it out there.
To approach the problem, we introduce a concept that will help us derive the span: the center of mass (CoM). In physics, the center of mass is the point where the entire mass of an object (or system of objects) can be considered to be concentrated when analyzing equilibrium and motion. Similarly, in an EMA, we can think of the weights assigned to past data points as having “mass.” In this context, the center of mass represents the “balance point” of these weights over time, helping us understand how much influence older data points have on the current average. In case of an equally spaced time series the CoM can be defined as follows:
where we exploited the fact that the sum of the weights of a weighted average should be one. We saw that the adjusted formula Eq. and the traditional formula Eq. is asymptotically equivalent. Here we can employ the asymptotic weight formula as we calculate the infinite sum of the weights:
Thus, substituting to Eq. we get:
where we can utilize yet another summation formula (see derivation in Appendix A2):
Consequently, Eq. renders to:
Now we know the EMA's center of mass. Remember that we defined the span as a likewise smoothing SMA window length. By similarly calculating the SMA's center of mass, we can make the connection between the alpha parameter of the EMA and the window size parameter of the SMA.
For an SMA of window size , the weights are uniform:
So the CoM for SMA becomes:
where we utilized the well-known sum of an arithmetic progression (see derivation in Appendix A3). We can also guess this intuitively: for a uniform weight distribution the CoM has to be in the middle. For example, the CoM for a window size indexed has to be in the middle at , which is exactly what the formula says.
Finally, we can form the relationship between them, requiring their CoM to be equal:
Let’s express both the window size and parameters from the above equation. Let’s also rename the window size parameter to , switching to span terminology from now on.
Now, we can understand the span formula that connects SMA and EMA by mapping the SMA’s window size paremeter to an equivalent EMA parameter.
4. Summary
We began with a historical overview of the traditional formula, followed by a detailed examination of its primary weakness—the initialization bias. To address this, we introduced the adjusted method, which effectively eliminates the bias. Finally, we derived the span formula, providing a more intuitive way to parametrize the EMA using an effective window size.
Appendix
A1. Geometric Series
A1.1 Finite Case
The finite geometric series is of the form:
Multiply the sum by the common ratio :
Now subtract from :
On the right-hand side, all terms cancel except for the first and last terms:
Now, solve for :
A1.2 Infinite Case
Now, let’s consider the infinite geometric series where the number of terms goes to infinity. The series is:
We could solve this with the same trick applied in the finite case (see A1.1), but instead, we will examine the finite geometric series formula asymptotically (for brevity). For an infinite geometric series, we assume that , so that as . Thus:
This is the sum of an infinite geometric series when .
A2. Another Summation formula
Derivation of the formula:
We know that the sum of an infinite geometric series is (see A12):
To get the sum involving , we differentiate the geometric series sum with respect to . Differentiating both sides of the geometric series sum:
To match the original series , multiply both sides of the equation by :
Since the sum starts from , we can extend the sum to without changing the result, because the term is zero:
A3. Sum of Arithmetic Progression
We can also write the sum as:
Now, add both versions of the sum:
Each pair sums to , and there are terms (including ), so:
consequently:
References
- https://math.stackexchange.com/questions/2664601/deriving-weight-formula-for-exponential-moving-average
- https://stats.stackexchange.com/questions/619558/switching-between-alpha-half-life-and-span-in-exponential-moving-average
- https://tedboy.github.io/pandas/computation/computation5.html
- https://stats.stackexchange.com/questions/534210/what-does-span-mean-in-exponential-moving-average
- https://gregorygundersen.com/blog/2022/06/04/moving-averages/