Moving windows

Calculating a statistic or processing data in a moving window has many applications. What is a “moving” or “sliding” window in the context of data analysis ?

The main idea of a moving window is to compute statistics in the vicinity of a given data sample by defining a window which includes the sample itself as well as some specified number of samples before and after the sample in question.

For a sample \(x_i\), we define a window \(W_i^{H,J}\) as \(W_i^{H,J} = \left\{ x_{i-H}, \dots, x_i, \dots, x_{i+J} \right\}\)

The parameters \(H\) and \(J\) are non-negative integers specifying the number of samples to include before and after the sample \(x_i\).

Statistics such as the mean and standard deviation of the window \(W_i^{H,J}\) may be computed, and then the window is shifted forward by one sample to focus on \(x_{i+1}\).

The total number of samples in the window is \(K = H + J + 1\).

To define a symmetric window centered on \(x_i\), one would set \(H = J = \left\lfloor K / 2 \right\rfloor\).

Exercise 1

Give some examples of how moving windows might be relevant in statistics.

Exercise 2

Generate a list of random numbers of length n and calculate a “moving” mean of length k over the data. You might find the Appendix entry on “list slicing” useful.

Exercise 3

Same as 2 but calculate the maximum (or minimum). You can use functions provided by python if you wish.

Exercise 4

Same as 3, but calculate the median.

Random variables with Numpy

The numpy library provides some functions for generating random variables from distributions that are not available in the random module. Here is an example.

from numpy.random import poisson
poisson(lam=1.0, size=None)
0

Exercise 5

Have a look at the section “Other 2D plot styles” in tha matplotlib chapter and plot a histogram of data drawn from a poisson distribution using numpy

Exercise 6

Use numpy to generate a list of random numbers drawn from the exponential distribution.

Exercise 7

Create a random sequence of uniformly distributed values in \([0,1]\) augmented with added noise that produces outliers with probability 0.05 that have a magnitude that is exponentialy distributed. Plot some examples of such seqences with different parameters usng matplotlib.

Exercise 8

Use your solutions to exercises 2, 3 and 4 to plot the moving mean, median, minimum and maximum using your data from 7.

More on moving windows

How do you account for the data at the ends of a moving window ? There are several ways of doing this, but two commonly used (and simple) methods are:

Padding

Pad Zero constructs a full window of length \(K\) by inserting zeros into the window near the signal end points. Effectively, the input signal is modified to

\(\tilde{x} = \{ \underbrace{0, \dots, 0}_{H \textrm{ zeros}}, x_1, x_2, \dots, x_{n-1}, x_n, \underbrace{0, \dots, 0}_{J \textrm{ zeros} } \}\)

to ensure a well-defined window for all \(x_i\).

Pad Value constructs a full window of length \(K\)
by padding the window with the first and last sample in the input signal. Effectively, the input signal is modified to

\(\tilde{x} = \{ \underbrace{x_1, \dots, x_1}_{H}, x_1, x_2, \dots, x_{n-1}, x_n, \underbrace{x_n, \dots, x_n}_{J} \}\)

Assessment

a) Write methods for calculating the following statistics on a moving window of length \(K\)

i) max [3]

ii) min [3]

iii) mean [3]

iv) median [3]

v) standard deviation [3]

vi) median absolute deviation [6]

You functions should take a list as an argument and return a list and can use any standard python functions (such as min) if you wish. However, you can only use the types of data structures described in the Appendix i.e. lists, strings, tuples, dictionaries, and sets. You can also use the array data structure from the numpy library if you wish.

b) Augment your methods from part a) to implement padding using

i) pad zero [3]

ii) pad value [3]

c) Generate some random sequences (of your choice) that include “outliers” and use them to visualise your methods developed in parts a) and 2b) with matplotlib. Use at least two distributions from the numpy package. Marks will be awarded for creativity both with respect to the structure of the data and the features in the plots. [10]

d) In terms of \(K\) (i.e not considering the length of the data \(N\)) determine the worst case complexity for each of your methods from part a). You should quote the complexity using “Big O” notation. Justify your answers using timing data produced using your methods and summarise this data graphically. Make sure you account for variation in the run time due to the structure of the data and random fluctuations in the avaialibity of CPU allocation on your computer system. [15]