2. Session 1 : Introduction to the Course - The 5Rs - Programming Challenges#
2.1. Programming and Science#
This program of study sets out to build an understanding of how to design and create high quality scientific software. Typically a practitioner acquires this understanding through years of practice and experience, and the underlying ideas and processes that it consists of are only rarely, if indeed ever, made explicit. This is not unusual in itself. In fact (at least in the United Kingdom), a complete program of scientific training can be undertaken, starting in school and finishing in graduate study, with no explicit reference ever being made to any specific scientific methodologies or their underlying principles and philosophical origins.
Maybe this is not a bad thing ? Indeed, by its very nature, a scientific methodology is unlikely to capture all the aspects associated with “doing” science, such as creativity, imagination, and guessing. Furthermore, in practice, and for many people, “doing” something is often much more enjoyable than “thinking” about it. Of course, when considering academics, the converse is sometimes true, especially when it comes to philosophers !!
Maybe somewhere between the two extremes is a good place to position a post graduate level course on scientific programming. It is this approach that has been adopted here. In doing so it is hoped that the student will acquire an understanding that has more fundamental and therefore (it is hoped) broader in application. For example, software design is studied in a language “agnostic” way, focussing on the core abstractions and ideas that make for good software design. However, the student is also required to realise these designs using one or more programming languages and systems and, in doing so, acquire some of the day to day practical skills necessary for creating quality scientific software.
A further hoped for benefit of teaching programming in a language agnostic manner is that it should furnish the student with the ability to assess the suitability and limitations of any particular language within the context of a given problem domain. This is not just wishful thinking. It is based on the belief that the language we use has an effect on how and what it is we think about. Moreover, common experience suggests that those who learn programming through a particular programming language continue to use the main idioms of this language when programming in other languages. This usually leads to poor structure which, in turn, compromises code quality.
2.2. Software Quality#
Quality software is not just about the quality of the programming. It also concerns other ideas, such as accessibility, portability, scalability, extensibility, and so on. How then do we assess the quality of software ? As with most attempts to answer questions regarding quality it only only makes sense to do so when the setting in which quality is to be assessed is understood. The basic assumption for this course is that the setting is scientific software as opposed to, for example, games software, or software for micro-controllers in commercial machines, or multimedia applications.
Does this really make any difference ? Should scientific software be considered differently ? What qualities should scientific software have ? How does the “development life cyle” of scientific software differ (if at all) from other types of software ? These questions, along with other other related ones, will be addressed as part of this course.
2.3. Course Objectives#
The main objectives of this course are
Provide a framework for assessing the quality of scientific software
Provide an understanding of generic programming patterns for representing and transforming abstractions appropriate for scientific software design and development
Provide basic software engineering skills appropriate for creating high quality scientific software
2.4. Key Ideas#
To facilitate in making progress toward achieving these objectives it is useful to set out the key ideas that are used to arrive at the concepts that support them.
Note that the ideas are presented as premises. This is done to emphasise that they are (as far as this course is concerned) only propositions and, as such, are open to the process of discussion, refinement, and change. Students are very much encouraged to actively take part in this process !
2.4.1. Premise 1#
Computation is Transformation of Representations
2.4.2. Premise 2#
Appropriate Representations and Transformations are Domain Specific
2.4.3. Premise 3#
The Representations most commonly applied in the domain of scientific computing are Mathematical in nature.
2.4.4. Premise 4#
When it is part of a scientific endeavour the quality of computing should be quantified with respect to the same metrics as the overall endeavour. These metrics naturally include Replicability, Reproducibility, and Repeatability
2.4.5. Premise 5#
In addition to Replicability, Reproducibility, and Repeatability, the quality of scientific computing should be quantified with respect to the intrisic software related metrics of Re-usability and Re-runability
2.4.6. Premise 6#
Quality software is composed from many specialised instances of a few generic patterns.
2.5. Directed Reading for Session 1#
The remainder of this session will be given over to “brushing up” on your python skills to help you access the remainder of the course.
However, between now and Session 2 (see course schedule in the module outline), you are required to study the following
the remaining material for Session 1 provided below (starting with the 5Rs section)
chapter 12 (or at least pages 228 to 242) in this book
Note that it will be difficult for you to access the ideas and material in Session 2 without reading through this material.
2.6. Problem Solving with Python#
2.6.1. Install pyenv and python#
2.6.1.1. Exercise#
For this exercise you may want to refer to the Introduction to Unix notes.
Use wget to download the shell script http://www.mathsbox.com/stor-609/scripts/setup-pyenv.sh
Execute the downloaded shell script (you will be asked for your administrator password)
Make sure pyenv is active using
source ~/start-pyenv
2.6.1.2. Exercise#
Install jupyter notebook and spyder on your computer. One way of doing this is
python -m pip install notebook
python -m pip install spyder
2.6.1.3. Exercise#
For the remainder of this session choose some problems from The PythonChallenge and/or Project Euler and try solving them using Python.
The objectives for this exercise are
check you can start python on your unix system - see the Introduction to Unix for help with unix
check you can use jupyter notebook and/or spyder
remind yourself of some of the python you learnt on the Introductory Python
have some fun !!
2.7. The 5Rs#
“Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort.”
Wilson et al. 2014. “Best Practices for Scientific Computing”.
Most practitioners would broadly agree with the preceding claim. Consequently, there is a large body of high quality and readily available work providing guidance to programmers with a scientific background. Most of these focus almost entirely on the details of how to do this e.g. commenting and documenting code, choosing sensible variable names, organising code across multiple files - even rules for how to lay your code out; the list goes on. However, they tend not to discuss why this should be done.
In contrast, this course first sets out to examine what it is that determines the quality of scientific software and then uses this as a basis for exploring how to achieve this quality. The rationale for this approach is simple - the desired qualities for the software do not change over time but, as we are all aware, the technology we use to create and maintain software does change and often very rapidly. Better then to fully understand the basic requirements so that we can translate this into ways of working the current technology and also be able to adapt our working practices as new technology emerges.
The key qualities associated with scientific software are
Replicability
Reproducibility
Repeatability
Re-usability
Re-runability
These will be refered to as The 5Rs.
Of course, to make any progress in employing these concepts in practice, it makes sense to have a good working definition for each one. It turns out that this is a little more difficult than might be first thought. However, a good starting point is to consider these concepts in terms the more traditional scientific setting of a laboratory. The following basic description of what it means to do each of the 5Rs is taken from Goble 2016
2.7.1. Re-run#
Variations on experiment and setup
“Let’s re-run the experiment, but this time we will test people in France instead Germany, and sample twice as many individuals.”
2.7.2. Repeat#
Same experiment, same setup
“Let’s repeat our experiment to check we followed the procedure correctly”
2.7.3. Replicate#
Same experiment, same setup, independent lab
“Let’s repeat the experiment of Fleischmann and Pons, but using our own equipment and researchers.”
2.7.4. Reproduce#
Variations of experiment, independent lab
“Let’s adapt the experiment of Fleischmann and Pons to see if we can get it to work using our own equipment and researchers.”
2.7.5. Reuse#
Variations of experiment, independent lab
“Let’s use John F. Shepard’s maze techniques but using cats instead of rats and using our own maze design”
2.8. The 5Rs - Roadmap and Signposts#
There are many techniques, tools and practices that can be adopted to guide and assist in realising good quality scientific software when assessed with respect to the 5Rs. The table below might serve as a reasonable starting point for providing a road map for achieving this.
Quaity |
Description |
Methods |
Documentation |
|---|---|---|---|
Reusable |
code can be easily adapted to variations in problem specification |
design patterns, packages / libraries |
in code documentation / api documentation |
Re-runnable |
someone else can run the code |
packages / libraries / repositories |
installation instructions and scripts, |
Repeatable |
same results over time |
unit tests / maintainance / version control |
use cases |
Reproducible |
same results for a given problem |
unit tests |
case stdies / articles / reports |
Replicable |
algorithm / solution can be recoded by someone else |
pseudo code |
articles / books / reports |
2.9. Best Practice#
The following is reproduced from Best Practices for Scientific Computing.
2.10. Further Reading and References#
The following works are referenced in the presentation and make for some interesting and thought provoking reading.
Best Practices for Scientific Computing
Developing Scientific Sortware
Reproducible Research for Scientific Computing