14. Assessment#

14.1. Overview#

This is the assessment for the M550 Python Module. It counts as 30% of your overall mark for M550.

The assessment is to be undertaken in small groups and your work should be submitted in the form of a single jupyter notebook. Your should have already been assigned a group to work in - please contact the course coordinator if you have any issues regarding this.

The submission date for this assessment is 15/11/2023. Your jupyter notebook should be emailed directly to daniel grose by 09:00 on this date (one copy per group).

Before your results and feedback are returned you might be asked to have a short (approximately 5 minutes) individual online “interview” to discuss some aspects of your work. The outcome of this interview might impact on your overall individual score.

The marks allocated to each part of the assessment are

Part 1 - 5 marks
Part 2 - 10 marks
Part 3 - 10 marks
Part 4 - 10 marks
Part 5 - 15 marks
Part 6 - 15 marks
Part 7 - 20 marks
Part 8 - 15 marks

Total - 100 marks

To score full marks in the questions you must show all of your code and calculations, make good use of plots for visualising and summarising your data, discuss any limitations associated with your approach, and comment your code. The questions are “exploratory” in nature, so you are expected to reflect on what you are being asked to do and research the methods being developed. For example, in Part 7, you might choose to investigate what the effect of the quadrat size has on the power of the method for determining non randomness, or you might want to say something about how the likelihood of a obtaining a particular result might be quantified, and so on ….

You can use the internet to search for examples of python code that may be useful, but do make reference to your sources.

If you have any questions regarding the course and / or the assessment, please do not hesitate to contact me by e-mail

14.2. Part 1#

The following code uses the numpy library to generate a data frame of 100 points uniformly distributed in the unit square.

import numpy as np
import pandas as pd
uniform_data = pd.DataFrame({"X" : np.random.uniform(0,1,100),"Y" : np.random.uniform(0,1,100)})

Using a suitable plot, visualise the data.

14.3. Part 2#

The distance between between two points \(p_{i} = (x_{i},y_{i})\) and \(p_{j} = (x_{j},y_{j})\) can be calculated using the Euclidean distance as

\(d_{ij} = \sqrt{(x_{i}-x_{j})^{2} + (y_{i}-y_{j})^{2}}\)

For a given point \(p_{i} = (x_{i},y_{i})\), its nearest neighbour is the point \(p_{j} = (x_{j},y_{j})\), \(i \neq j\), such that the distance \(d_{ij}\) is the smallest amongst all distances from \(p_{i}\) to all other points.

Use Python to create a pandas data frame containing coordinates of random points along with a column containing the distance from each point to its nearest neighbour. Your data might be organised in a way something like that shown below

x

y

d

0

0.623908

0.632107

0.020188

1

0.031473

0.972909

0.055497

2

0.307875

0.505808

0.039454

3

0.773175

0.807915

0.064741

4

0.208603

0.902263

0.017473

95

0.878250

0.173126

0.062597

96

0.866103

0.276347

0.044193

14.4. Part 3#

If n points are placed at random in a region with area \(A\) then the expected value of the distance from each point to its nearest neighbour is given by Clark and Evans (1954) as

\(E[d] = \frac{\sqrt{\sigma}}{2}\)

where \(d\) is the distance and \(\sigma = A/n\) is the population density.

Using your result from Part 2, determine if your randomly generated data is consistent with this claim.

14.5. Part 4#

By repeatedly generating random data and calculating the mean nearest neighbour distances, determine approximate values \([a,b]\) for a 95% confidence interval of \(E[d]\) in the unit square.

Is your confidence interval centered on the value of \(E[d]\) proposed by Clark and Evans ? If not, why might this be the case ? Does the confidence interval depend on the size of the random samples ? If so, how does it vary ?

14.6. Part 5#

Using numpy or otherwise, generate some exampes of non-uniformly distributed points in the plane and see if the mean nearest neighbour distances are contained inside an appropriately determined 95% confidence interval.

How good do you think the statistic proposed by Clark and Evans is for determining if spatial points are randomly distributed ?

It is possible to create a non-random(pathological) data set that has a mean nearest neighbour distance equal to the Clarkand Evans statistic. Can you create/describe such a set of data ?

14.7. Part 6#

The statistic proposed by Clark and Evans can be useful for determining non-randomness in spatial data, but it does not quantify how the data deviates from the assumption of randomness. The use of quadrat based methods can help describe how a spatial pattern is distributed.

Typically, a quadrat based method uses a contiguous regular grid to “bin” the spatially distributed data into counts. If the bins are of equal size, and the data is randomly distributed in the region of the grid, then the counts \(x\) in each bin would be expected to have a poisson distribution

\(P(x) = \frac{{\rm e}^{\lambda}\lambda^{x}}{x!}\)

For \(N\) samples and a grid with area \(S\) and quadrat area \(A\) the value of \(\lambda\) is

\(\lambda = N\frac{A}{S}\)

Use Python to automate the process of generating the coordinates of the vertices of a \(n\) by \(m\) grid made up of rectangles of dimensions \(\delta x\),\(\delta y\). Your data should be in a pandas data frame and each row should have two x coordinates and two y coordinates which define each quadrat in the grid. Your data frame might look something like the one shown below

x0

x1

y0

y1

0

0.0

0.1

0.0

0.1

1

0.0

0.1

0.1

0.2

2

0.0

0.1

0.2

0.3

3

0.0

0.1

0.3

0.4

4

0.0

0.1

0.4

0.5

95

0.9

1.0

0.5

0.6

96

0.9

1.0

0.6

0.7

97

0.9

1.0

0.7

0.8

14.8. Part 7#

Using your result from Part 6 and some appropriately generated data, determine the counts for each quadrat in the grid and summarise your results in a data frame. Your data frame might look something like

x0

x1

y0

y1

count

0

0.0

0.1

0.0

0.1

0

1

0.0

0.1

0.1

0.2

2

2

0.0

0.1

0.2

0.3

0

3

0.0

0.1

0.3

0.4

0

4

0.0

0.1

0.4

0.5

1

95

0.9

1.0

0.5

0.6

0

96

0.9

1.0

0.6

0.7

3

97

0.9

1.0

0.7

0.8

2

Compare your “count” data with the appropriate poisson disribution.

Explain how you would use your results to detect non random data. Would this quadrat based method detect non randomness in your “pathological” data from Part 5 ?

Using appropriate visualisations and statistics compare and contrast the quadrat method with the approach proposed by Clark and Evans.

14.9. Part 8#

The data set ./data/trees.txt contains the (normalised) x y coordinates of the locations of a certain species of tree in a square region in a small wood. The data set is tab seperated.

Using the methods you have developed in the previos parts of this assessment, investigate the claim that these trees are not randomly distributed. Write a short report describing your investigation and use your results to form a conclusion regarding the claim of non randomness. You can assume that the readership of your report has a statistical training, but it should also include an “executive summary” for non technically trained readers.