1. Module Outline#
1.1. Overview#
An essential component of Statistics and Operational Research today is the computer implementation of models and algorithms. Although modern high-level programming languages such as Python and R have greatly facilitated coding for scientists and analysts, the nature of research often means coding projects are complicated, requiring many source code files, potentially thousands of lines of code, and a precise set of dependencies. For scientific research, where accuracy, transparency and reproducibility are key tenets, code must be particularly reliable, and academic journals are increasingly requiring the submission of source code alongside published articles.
Unfortunately, doctoral students in Statistics and Operational Research (and scientists more generally), usually have little or no formal training in computer science, nor experience of software engineering, and so are often ill-equipped for the task of developing and maintaining research code. As a result, research code is often seen as unreliable.
This module aims to address this potential deficiency by training you to produce scientific software that is replicable, reproducible, reusable, re-runnable, and repeatable.
The course is divided into three parts. In the first part foundational computation science concepts will be introduced. In particular we will cover the analysis of algorithms and data structures, and different of models of programming and design patterns. The second part of the course then proceeds with how these ideas are applied in practice with particular reference to the software engineering practices associated with collaborative programming, software maintenance and support, testing and distribution of software. The final part of the course consists of a larger group programming project for which you will be required to draw on the knowledge and skills developed in the rest of the course.
During the lectures and workshops, emphasis is placed on problems and techniques associated with the fields of statistics and operational research. Although programming concepts will be taught in a language-agnostic way, we will primarily use Python for teaching, as this is a popular language for scientific computing and it supports the required programming paradigms.
1.2. Prerequisites#
Basic knowledge of Python, such as that as provided by the short course “Introductory Python”
Basic Unix skills, such as those acquired from the short course Introduction to Unix
Access to a Windows and a Unix based systems with Python and R installed.
1.3. Learning Outcomes#
Students who pass this module should be able to:
analyse an algorithm in terms of its computational complexity and determine appropriate data structures for its implementation in software
use the three main models of programming (imperative, object-oriented and functional), and identify applicable generic design patterns
use software engineering tools such as profilers, debuggers, testing frameworks and environment management systems
understand basic computer architecture and operating systems and be able to parallelize code where appropriate
use tools and mechanisms for collaborative programming, and distribution and support of code within the wider research community
produce scientific software that is replicable, reproducible, reusable, re-runnable, and repeatable
1.4. Teaching#
This module will be taught by a mixture of lectures (10 hours total) and computer workshops (24 hours total).
1.5. Assessment#
There are three main assessments for this course. Two smaller assessments will take place while the module is being taught to reinforce fundamental concepts and materials introduced in the earlier parts of the course, as well as presenting an opportunity to provide students with formative feedback. The final assessment, which has the highest weighting, will take the form of a large coding project of the type you may have to undertake in their future research.
Details of each assessment are as follows:
Programming Assessment (Weighting: 25%)
The first assessment requires students to demonstrate an application of fundamental concepts regarding standard patterns and data structures to make appropriate software design decisions for a given algorithm. Students will then implement their design using an appropriate programming language. The implementation should include an element of parallelisation.
Software Engineering Assessment (Weighting: 25%)
The second assignment requires students to study and implement a given algorithm. The implementation should be engineered so as to be suitable for hosting on a public repository, such as PyPI or CRAN, and be under publicly accessible version control using, for example, GitHub. The software should have supporting material including, but not restricted to, documentation, use cases, and information regarding how it can be re-used and extended by other researchers.
Reproduction Group Project (Weighting: 50%)
The final coursework will consist of a programming reproduction exercise where groups will have to implement a methodology from the stats or OR literature as a package. This will require students to draw on all concepts and practices covered during the course. Besides producing a code repository and report, this component will also include a group presentation and peer assessment component. The peer assessment component is included to ensure that each team member is fairly awarded for their contribution to the project. The final deadline for this assessment will be in week 25.
1.6. Timeline#
These dates and timings are provisional and may be updated over the course of the module.
Part 1: Foundations (January-February)
Lecture 1: Introduction (1 hour) (presentation)
Tutorial 1 (3 hours)
Lecture 2: The 5 R’s (1 hour) (presentation)
Tutorial 2 (3 hours)
Lecture 3a: Functional Design Patterns (1 hour)
Tutorial 3a (3 hours)
Lecture 3b: Data Structures and Computational Complexity (2 hours) (presentation)
Tutorial 3b (3 hours)
Assessment 1:
Release date: Friday 31st January (after Lecture 3b)
Deadline: 10am on Friday 14th February
Part 2: Software Engineering (March)
Lecture 4a: Software Engineering Tools 1 (2 hours)
Tutorial 4a (2 hours)
Lecture 4b: Software Engineering Tools 2 (2 hours)
Tutorial 4b (2 hours)
Assessment 2:
Release date: Monday 10th March (after Lecture 4b)
Deadline: 10am on Friday 28th March
Part 3: Reproduction Project (April-May)
Lecture 5: Bringing It All Together
Assessment 3:
Release date: Monday 28th April (in Lecture 5)
Report and Code Deadline: 10am Tuesday 20th May
Peer Assessment Form Deadline: 10am on Thursday 22nd May
Group Presentation: Friday 23rd May
Tutorial 6: Assessment 3 support session (3 hours)
Tutorial 7: Assessment 3 support session (3 hours)
Tutorial 8: Assesssment 3 support sesssion (3 hours)
1.7. Teaching Staff and Contact#
The module will be primarily taught by Dan Grose. Jamie Fairbrother will assist with teaching, assessment and be responsible for administrative aspects of the module.
Jamie Fairbrother (module convenor)
Office: D32, Charles Carter
Dan Grose
Announcements for the Module will be made Teams group. Additional resources such as code may also be shared here. You may also post your questions here.
1.8. Course Materials and Further Reading#
All lecture notes and tutorial worksheets will be provided through the course website. Although the lecture and workshops are self-contained, the following references may prove useful:
Benureau, F. C. Y. and Rougier, N. P. (2018). Re-run, Repeat, Reproduce, Reuse, Replicate: Transforming Code into Scientific Contributions. Frontiers in Neuroinformatics. 11. https://doi.org/10.3389/fninf.2017.00069
Cormen, T. H., Leiserson, C. E., Rivest, R. L. and Clifford, S. (2009). Introduction to Algorithms (2nd Edition). MIT Press.
Roy, P. V. and Haridi, S. (2004). Concepts, Techniques and Models of Computer Programming. MIT Press.
Gamma, E., Helm, R., Johnson, R. and Vlissides, J. M. (1994). Design Patterns: Elements of Reusable Object- Oriented Software. Addison-Wesley Professional.