Session 6 : Data Structures - Computational Complexity

Contents

8. Session 6 : Data Structures - Computational Complexity#

8.1. Session Overview#

The objectives of this session are

  • Understand datastructures in a declarative sense as mappings between sets

  • Understand computational complexity in terms of asymptotics

  • Understand the difference between several common asymptotic measures of computational complexity

  • Understand the difference between best, worst, and average cases

  • Know the basic data structures provided by python and what mappings they correspond to, and how to use them

  • Know the computational complexity of popular python implementations of common data structures

  • Know how to choose data structures according to use case

8.2. Data structures as mappings#

At a useful level of abstraction, a data structure is a way of representing and organizing data so that certain operations can be performed efficiently.

Most data structures can be viewed as implementations of one or more mathematical mappings between sets.

At minimum, a data structure represents a (possibly partial) function

\[f : K \to V\]

where \(K\) is a set of keys (indices, positions, labels, states) and \(V\) is a set of values.

8.2.1. Examples#

  • Arrays / vectors: a mapping from a finite index set \(\{0,\dots,n-1\}\) to values.

  • Hash tables / dictionaries: a mapping from an arbitrary key set to values, typically approximating constant-time access.

  • Trees and graphs: mappings from nodes to collections of nodes (adjacency), or from keys to values with additional structural constraints.

  • Matrices and tensors: mappings from multi-dimensional index tuples to values.

What distinguishes data structures is not the mapping itself, but the constraints imposed on how the mapping is stored and traversed (ordering, hierarchy, sparsity, locality).

8.2.2. Operations and abstract interfaces#

Each data structure supports a set of operations that correspond to queries or transformations of the mapping:

  • lookup: evaluate \(f(k)\)

  • insert / delete: modify the domain or codomain

  • iterate: traverse the domain in a prescribed order

  • aggregate: compute functions over subsets of the mapping

Formally, a data structure can be seen as an algorithmic representation of an abstract data type (ADT): a set plus a collection of operations with specified semantics.

8.2.3. Computational complexity as the defining trade-off#

The central purpose of a data structure is to control the computational complexity of these operations.

  • Complexity is typically measured asymptotically in terms of:

    • time (e.g. \(O(1), O(\log n), O(n)\))

    • space (memory usage as a function of input size)

  • Different data structures represent the same abstract mapping but optimize different operations.

8.2.4. Examples of trade-offs#

  • An array gives \(O(1)\) access by index but \(O(n)\) insertion.

  • A balanced binary search tree gives \(O(\log n)\) lookup, insertion, and deletion while maintaining order.

  • A hash table sacrifices ordering to achieve expected \(O(1)\) lookup.

Choosing a data structure is therefore equivalent to choosing a complexity profile over a set of operations.

8.2.5. Summary#

  • Data structures mediate between mathematical objects (functions, relations, graphs, tensors) and computational reality (finite memory, cache behavior, parallelism).

  • They determine whether an algorithm is feasible at scale, not just correct.

  • Many performance bottlenecks arise not from the algorithmic idea, but from a mismatch between the structure of the data and the structure imposed by the data structure.

In short: a data structure is a concrete, complexity-aware realization of a mapping, designed to make specific operations fast while accepting trade-offs elsewhere.

8.3. Computational Complexity#

Computational complexity studies how the resources required by an algorithm scale with the size of the input. The two dominant resources are:

  • Time complexity: number of elementary operations executed

  • Space complexity: amount of memory used

Formally, if an algorithm takes an input of size ( n ), its time complexity is a function

\(T(n) = \text{number of steps executed on inputs of size } n.\)

The key idea is scaling behavior, not exact runtime. Complexity theory abstracts away:

  • hardware details,

  • constant factors,

  • implementation quirks,

and focuses on growth rates as \( n \to \infty \).

This is important because :

  • datasets grow,

  • asymptotic behavior dominates practical feasibility,

  • algorithm choice often beats micro-optimisation by orders of magnitude.

8.3.1. Asymptotic analysis#

8.3.1.1. Big-O notation (upper bounds)#

We say \(T(n) \in O(f(n))\) if there exist constants \( c > 0 \) and \( n_0 \) such that \(T(n) \le c f(n) \quad \forall n \ge n_0\).

Interpretation:

  • \( f(n) \) is an asymptotic upper bound on the algorithm’s growth.

Example:

  • \(T(n) = 3n^2 + 5n + 20 \in O(n^2)\).

8.3.1.2. Big-Ω notation (lower bounds)#

\( T(n) \in \Omega(f(n)) \quad \Longleftrightarrow \quad T(n) \ge c f(n) \text{ eventually}. \)

This gives a guaranteed minimum growth rate.

8.3.1.3. Big-Θ notation (tight bounds)#

\( T(n) \in \Theta(f(n)) \quad \Longleftrightarrow \quad T(n) \in O(f(n)) \cap \Omega(f(n)). \)

This means \( f(n) \) characterises the true asymptotic order.

8.3.1.4. Overview#

The above definitions and figure are taken from Intoduction to Algorithms

In practice, Big-O is most commonly used, even when tighter bounds exist.

8.3.2. Best case, worst case, and average case#

An algorithm’s runtime may depend not just on input size \( n \), but on the structure of the input. We therefore distinguish cases.

8.3.2.1. Worst-case complexity#

\( T_{\text{worst}}(n) = \max_{\text{inputs}} T(\text{input}) \) where all of the iputs are of size \(n\).

  • Provides a guarantee

  • Common in algorithm analysis and systems design

  • Often pessimistic but robust

Example:
Linear search in an unsorted array

  • Worst case: element not present → ( O(n) )

8.3.2.2. Best-case complexity#

\( T_{\text{best}}(n) = \min_{\text{inputs}} T(\text{input}) \) where all of the iputs are of size \(n\).

  • Rarely used alone

  • Can be misleading

  • Useful when best-case happens frequently or is structurally meaningful

Example:

  • Linear search: target is first element → ( O(1) )

8.3.2.3. Average-case complexity#

\( T_{\text{avg}}(n) = \mathbb{E}[T(X_n)] \) where \( X_n \) is a random input of size \( n \) drawn from some probability distribution.

  • Average case is only meaningful relative to an assumed input distribution.

Example:

  • Quicksort:

    • Worst case: ( O(n^2) )

    • Average case (under random permutations): \( \Theta(n \log n) \)

Importantly, This connects naturally to:

  • probabilistic modelling of data,

  • assumptions about randomness,

  • distribution-sensitive algorithms.

8.3.2.4. Amortised analysis: averaging over sequences, not inputs#

Amortised complexity analyzes the average cost per operation over a sequence of operations, even if individual operations are expensive.

Crucially :

  • No probability is involved

  • Guarantees hold for every sequence

Let a sequence of \( m \) operations take total time \( T(m) \).
The amortised cost per operation is: \( \frac{T(m)}{m}. \)

8.4. Computational Complexity of Common Python Data Structures#

8.4.1. list#

Operation

Time

Index access a[i]

O(1)

Append append(x)

O(1) amortized

Pop end pop()

O(1)

Insert / delete middle

O(n)

Membership x in a

O(n)

Sort

O(n log n)

8.4.2. dict#

Operation

Time

Lookup d[k]

O(1) avg

Insert / update

O(1) avg

Delete

O(1) avg

Membership k in d

O(1) avg

Iterate

O(n)

  • Worst-case O(n) (hash collisions, rare)

8.4.3. set#

Operation

Time

Add

O(1) avg

Remove

O(1) avg

Membership

O(1) avg

Union / intersect

O(n)

8.4.4. deque#

Operation

Time

Append right / left

O(1)

Pop right / left

O(1)

Index access d[i]

O(n)

Insert / delete middle

O(n)

  • Best for queues, BFS, sliding windows

8.4.5. namedtuple#

Operation

Time

Field access p.x

O(1)

Index access

O(1)

Creation

O(n)

Mutation

❌ (immutable)

  • Same performance as tuple, just nicer syntax

8.4.6. SortedList#

Operation

Time

Index access

O(1)

Insert

O(log n)

Delete

O(log n)

Membership

O(log n)

Bisect

O(log n)

8.4.7. SortedDict#

Operation

Time

Lookup

O(log n)

Insert

O(log n)

Delete

O(log n)

Min / max key

O(1)

8.4.8. SortedSet#

Operation

Time

Add

O(log n)

Remove

O(log n)

Membership

O(log n)

Iterate (sorted)

O(n)

8.5. Exercises#

8.5.1. list#

8.5.1.1. Key ideas#

  • Ordered, mutable, allows duplicates

  • Fast append, slow insert/delete in the middle

8.5.1.2. Exercise 1 - Indexing and slicing#

data = [3, 5, 7, 11, 13, 17]
  1. Extract the last three elements using slicing

  2. Reverse the list using slicing (no loops)

  3. What happens if you try to access data[10]?

8.5.1.3. Exercise 2: Mutation#

Exercise 2: Mutation

data = [10, 20, 30]
  1. Append 40

  2. Insert 15 at index 1

  3. Remove 20 using two different methods

  4. Observe the return values of append, insert, and remove

8.5.1.4. Exercise 3: Performance intuition#

  1. Time how long it takes to:

    • append() 100,000 items

    • insert(0, x) 100,000 items

  2. Explain the difference.

8.5.1.5. Use cases#

  • Storing raw observations

  • Feature vectors

  • Sequential data (time series before indexing)

  • Mini-batches during model training

8.5.2. tuple#

8.5.2.1. Key ideas#

  • Ordered, immutable

  • Hashable (if contents are hashable)

8.5.2.2. Exercise 1: Immutability#

point = (3, 4)
  1. Try changing point[0]

  2. Create a new tuple with x doubled

8.5.2.3. Exercise 2: Tuple unpacking#

record = ("Alice", 22, "Biology")
  1. Unpack into three variables

  2. Swap two variables without using a temp variable

8.5.2.4. Exercise 3: Tuples as keys#

locations = {
    (40.7, -74.0): "New York",
    (34.0, -118.2): "Los Angeles"
}
  1. Look up a city

  2. Explain why lists can’t be used as keys

8.5.2.5. Use cases#

  • Fixed records (rows)

  • Coordinates (lat, long)

  • Dictionary keys for multi-dimensional indexing

  • Lightweight, memory-efficient records

8.5.3. set#

8.5.4. Key ideas#

  • Unordered, mutable

  • No duplicates

  • Fast membership tests

Exercise 1: Deduplication

ids = [101, 102, 101, 103, 102]
  1. Convert to a set

  2. Convert back to a list

  3. What information was lost?

Exercise 2: Set operations*

A = {1, 2, 3, 4}
B = {3, 4, 5, 6}
  1. Compute union, intersection, difference

  2. Use both operators and methods

8.5.4.1. Exercise 3: Membership timing#

  1. Time x in list vs x in set for large collections

  2. Explain why the set is faster

8.5.4.2. Use cases#

  • Unique identifiers

  • Feature vocabularies

  • Removing duplicates

  • Fast filtering and membership checks

8.5.5. dict#

8.5.5.1. Key ideas#

  • Key–value mapping

  • Insertion-ordered (Python ≥3.7)

  • Fast lookups

8.5.5.2. Exercise 1: Basic operations#

counts = {}
  1. Count word frequencies from a list of strings

  2. Rewrite using dict.get

  3. Rewrite using collections.defaultdict

8.5.5.3. Exercise 2: Iteration#

data = {"A": 10, "B": 20, "C": 30}
  1. Iterate over keys

  2. Iterate over values

  3. Iterate over key–value pairs

8.5.5.4. Exercise 3: Dictionary views#

  1. Store keys(), values(), items() in variables

  2. Modify the dictionary

  3. Observe how the views change

8.5.5.5. Use cases#

  • Feature mappings

  • Label encoding

  • Aggregations

  • Sparse representations

8.5.6. str#

8.5.6.1. Key ideas#

  • Immutable sequence of characters

  • Rich API for text processing

8.5.6.2. Exercise 1: Indexing and slicing#

text = "Data Science"
  1. Extract "Science"

  2. Reverse the string

  3. Get every second character

8.5.6.3. Exercise 2: String methods#

  1. Convert to lowercase

  2. Replace spaces with _

  3. Split into words

8.5.6.4. Exercise 3: Immutability#

  1. Try modifying a character

  2. Create a modified copy instead

8.5.6.5. Use cases#

  • Text preprocessing

  • Tokenization

  • Feature extraction

  • Labels and metadata

8.5.7. deque#

8.5.7.1. Key ideas#

  • Double-ended queue

  • Fast appends and pops from both ends

8.5.7.2. Exercise 1: Basic operations#

from collections import deque
dq = deque([1, 2, 3])
  1. Append to the right

  2. Append to the left

  3. Pop from both ends

8.5.7.3. Exercise 2: Sliding window#

  1. Use a deque of max length 5

  2. Simulate streaming data

  3. Observe how old values are discarded

8.5.7.4. Exercise 3: Compare with list#

  1. Time pop(0) on a list

  2. Time popleft() on a deque

8.5.7.5. Use cases#

  • Streaming data

  • Rolling averages

  • Buffers and queues

  • Time-windowed analytics

8.5.8. namedtuple#

8.5.8.1. Key ideas#

  • Tuple with named fields

  • Immutable

  • Lightweight alternative to classes

8.5.8.2. Exercise 1: Creation#

from collections import namedtuple
Student = namedtuple("Student", ["name", "age", "major"])
  1. Create a student

  2. Access fields by name and index

8.5.8.3. Exercise 2: Immutability#

  1. Try modifying a field

  2. Use _replace to create a new version

8.5.8.4. Exercise 3: Comparison#

  1. Compare a namedtuple to:

    • tuple

    • dictionary

  2. Discuss readability vs flexibility

8.5.8.5. Use cases#

  • Structured records

  • Configuration objects

  • Model outputs

  • Lightweight row representations

8.5.9. SortedList#

8.5.9.1. Key ideas#

  • Maintains sorted order automatically

  • Fast lookup and slicing

8.5.9.2. Exercise 1: Automatic sorting#

from sortedcontainers import SortedList
sl = SortedList()
  1. Add numbers in random order

  2. Observe ordering after each insert

8.5.9.3. Exercise 2: Indexing and slicing#

  1. Access smallest and largest values

  2. Slice the middle range

8.5.9.4. Exercise 3: Comparison#

  1. Compare with:

    • sorting a list repeatedly

    • heapq

  2. Discuss trade-offs

8.5.9.5. Use cases#

  • Online statistics

  • Maintaining quantiles

  • Ranked scores

  • Real-time leaderboards

8.5.10. SortedSet#

8.5.10.1. Key ideas#

  • Sorted + unique

  • Combines set semantics with order

8.5.10.2. Exercise 1: Deduplication + order#

from sortedcontainers import SortedSet
ss = SortedSet([5, 1, 3, 3, 2])
  1. Observe ordering

  2. Try adding duplicates

8.5.10.3. Exercise 2: Set operations#

  1. Intersection and union with another SortedSet

  2. Compare with regular set

8.5.10.4. Data science use cases#

  • Unique sorted categories

  • Feature vocabularies with order

  • Ranked identifiers

8.5.11. SortedDict#

8.5.11.1. Key ideas#

  • Dictionary sorted by keys

  • Efficient range queries

8.5.11.2. Exercise 1: Ordered keys#

from sortedcontainers import SortedDict
sd = SortedDict()
  1. Insert keys out of order

  2. Iterate over keys and values

8.5.11.3. Exercise 2: Range queries#

  1. Retrieve items with keys between two values

  2. Compare with normal dict

8.5.11.4. Use cases#

  • Time-indexed data

  • Ordered categorical mappings

  • Efficient range filtering

  • Event logs

8.5.12. Summary Exercise#

  1. Which structures preserve order? Which don’t?

  2. Which are mutable vs immutable?

  3. When would sortedcontainers outperform built-in types?

  4. Which data structures would you choose for:

    • Streaming data

    • Text analysis

    • Real-time rankings

    • Sparse features

8.6. Data Structure Choice#

8.6.1. Example 1#

You need to store a growing sequence of observations in arrival order

Choice: list

Why:

  • Preserves insertion order

  • Fast append()O(1) amortized

  • Simple indexing and slicing

Why not others:

  • deque is unnecessary unless removing from the front

  • SortedList adds overhead if order of arrival matters

8.6.2. Example 2#

You need a fixed record representing a row of data (name, age, major)

Choice: namedtuple

Why:

  • Immutable → safer for records

  • Clear, readable field access (student.age)

  • More memory-efficient than a dictionary

Why not others:

  • dict is flexible but less structured

  • tuple lacks readability

8.6.3. Example 3#

You need fast membership tests for millions of IDs

Choice: set

Why:

  • Average-case O(1) membership tests

  • Automatically removes duplicates

Why not others:

  • list membership is O(n)

  • SortedSet adds overhead if order is irrelevant

8.6.4. Example 4#

You need to count frequencies of categories or words

Choice: dict (or defaultdict)

Why:

  • Fast updates and lookups → O(1) average

  • Natural key → count mapping

  • Scales well for sparse data

Why not others:

  • list would require searching

  • set cannot store counts

8.6.5. Example 5#

You need to remove duplicates but also keep values sorted

Choice: SortedSet

Why:

  • Enforces uniqueness like a set

  • Maintains sorted order automatically

  • Efficient iteration in order

Why not others:

  • set is unordered

  • list requires manual sorting and duplicate removal

8.6.6. Example 6#

You need a rolling window over streaming data

Choice: deque

Why:

  • Fast insertion/removal at both ends → O(1)

  • Supports fixed-size windows

  • Designed for queue-like behavior

Why not others:

  • list.pop(0) is O(n)

  • SortedList is unnecessary if only order matters

8.6.7. Example 7#

You need to compute rolling medians in real time

Choice: SortedList

Why:

  • Maintains sorted order on insertion → O(log n)

  • Median retrieval is O(1)

  • Ideal for online statistics

Why not others:

  • Sorting a list repeatedly is expensive

  • heapq is more complex for median logic

8.6.8. Example 8#

You need to store time-indexed data and query ranges

Choice: SortedDict

Why:

  • Keys always sorted

  • Efficient range queries

  • Natural fit for timestamps

Why not others:

  • Regular dict requires manual sorting

  • list loses key–value semantics

8.6.9. Example 9#

You need to store coordinates or multi-dimensional keys

Choice: tuple (as dictionary keys)

Why:

  • Immutable and hashable

  • Lightweight

  • Safe as dictionary keys

Why not others:

  • list is mutable and unhashable

  • dict is unnecessary nesting

8.6.10. Example 10#

You need to preprocess and clean text data

Choice: str (with list/dict support)

Why:

  • Rich built-in API

  • Immutable → safe transformations

  • Works well with pipelines

Why not others:

  • list of characters is harder to manage

  • Mutability adds bugs without benefit

8.7. Summary#

8.7.1. Cheat Sheet#

Choose the simplest structure that

  • supports the required operations efficiently

  • matches the data’s mutability and ordering needs

  • minimizes unnecessary overhead

Need

Best Choice

Fast append

list

Fast front removal

deque

Unique items

set

Key → value

dict

Fixed record

namedtuple

Sorted data

SortedList

Sorted + unique

SortedSet

Sorted keys

SortedDict