Software Engineering Tools, Principles and Practices for Scientists
Table of Contents
Software engineering is a discipline of principles and practices designed to improve people’s ability to create and manage complex software projects and collaborate with others on developing them. People have built software tools to assist with these principles and practices. This article focuses on exploring the essential practices and principles, and related tools, for scientists who wish to write code that other people can understand, use, and collaborate on developing. For those who wish to take a deeper dive into software engineering, the Software Engineering book1 covers the discipline in detail.
Tools and Technologies
Operating system sits at the core of all software projects. Software projects run on operating systems and are developed on operating systems using tools that run on operating systems. If you can choose, we recommend using a Linux based operating system due to its programmer-friendly design. Installing programming-related tools is easier on Linux, and we can run operations using text commands in the terminal. For getting started with Linux, we recommend Ubuntu, a versatile and easy-to-install desktop environment. However, Windows or Mac also works.
An essential feature of an operating system is the filesystem. Filesystem enables us to store data in files and create directories for grouping files into separate collections. Directories are hierarchical, that is, we can create directories inside other directories. We also need to distinguish between two types of files; text files and binary files. Text files are human-readable files containing only text, and we can open them using a text editor. Binary files machine-readable files stored in binary format, and therefore require specific software for opening them. An executable file is a binary file that executes a program when opened. Filename extensions, such as
Understanding the filesystem is essential because fundamentally, a software project is a directory named after the project that contains files and directories specific to that project. Most of the contained files are special text files referred to as source code. To edit source code, we can use a code editor. Code editors offer tools that assist in writing source code, such as syntax highlighting, code analysis, and other integrated tools. We may also refer to them as integrated development environments or IDEs for short. We recommend Visual Studio Code or JetBrains' editors.
With version control, we can track changes that we make to the contents in the project directory as we start developing the project. We refer the project directory initialized with version control as a repository. The defacto software for version control is Git. Git allows creating repositories, recoding changes, tagging specific points in a repository’s history, branching, and collaboration. Branching refers to diverging from the main line of development and continue to work without messing up the work in the mainline. The Pro Git book covers details on everything Git related. 2
Services can host repositories on the internet, which makes collaboration possible. We refer to these repositories as remote repositories. The most popular service for hosting Git repositories is GitHub. Github also has features for issue tracking, code review, and hosting documentation.
Also, there exist graphical user interfaces for Git. They help to interact with Git, visualizing repository changes, and change history. We recommend using GitKraken. Also, many code editors and IDEs bundle a user interface for Git.
Principles and Practices
We begin a software project by choosing a suitable programming language for writing the source code. When choosing the programming language, we need to consider its syntax, type system, program execution, and supported programming paradigms. Some programming languages are restricted to a single paradigm, while others support multiple paradigms. The programming language dictates how we express problems in the code, which tools and libraries are available, and what kind of performance we can expect from our program.
Here are use cases of some popular programming paradigms:
Object-oriented programming is based on objects which have attributes and methods which can modify the attributes. In many object-oriented languages, objects can inherit attributes and methods from other objects, thus making them suitable for solving problems with an inherent hierarchical structure. Many everyday programming tasks are intuitive and easy to write using object-oriented languages, but they have limitations in expressing mathematical structures and logic.
Functional programming is suitable for solving problems that are inherently recursive since all programs are constructed by applying and composing functions. Purely functional languages enforces all functions strictly to be deterministic mathematical functions, that is, pure functions, which means that for given argument they always return the same result. Purity restricts side effects and makes programs more robust. Functional languages are regarded as more challenging to write because they require thinking problems more deeply.
Logic programming and constraint programming are used to express facts and rules about some problem domain. For example, languages for expressing satisfiability problems or different types of optimization problems belong to this paradigm.
In theory, we can use all universal programming languages to solve the same problems. However, in practice, it can be more natural to express a particular problem in one language than another. Here are some examples of popular programming languages.
Python is dynamically typed, interpreted, object-oriented programming language. It is easy to write and understand, has a large, active community and an extensive collection of third-party packages. Python is an excellent option for general programming, scripting, or getting started with programming.
Julia is a dynamically and optionally typed, just-in-time compiled, high-performance programming language which uses multiple-dispatch as a paradigm. Additionally, it allows the use of Unicode symbols. Julia is an excellent option for scientific computing and applied mathematics.
Haskell is an advanced, purely functional programming language. It is an excellent option for getting started with functional programming.
C++ is a statically compiled, multi-purpose programming language which allows low-level and high-level programming. It is a common choice for programming tasks that require access to the hardware level.
As the project starts growing, it becomes necessary to verify that the program functions correctly. Unit testing is one method of verifying correctness. Unit testing refers to the testing of individual components of the program by writing test cases for each of its constituents. We can also measure the percentage of the source code being tested by the unit tests, which we refer to as code coverage. Typically, programming languages include libraries and conventions for performing and writing unit tests and measuring the code coverage.
Another essential part of software engineering is to communicate to others how our project works. One way to communicate this knowledge is to create a documentation of the project. Documentation includes two components:
General documentation. Describes the functionality of the software project in general terms using text, figures, math, and other elements.
API documentation. In most programming languages, we can document constituents in the source code using document strings. A document string, for example, may describe a function and its inputs and outputs. A documentation software can then fetch document strings from the source code to create the API documentation.
Each programming language usually has a documentation library and documenting conventions. Typically, the documentation is written in textfiles using a markup language, as specified by the documentation library.
When releasing software to the public, we need software versioning. Versioning refers to assigning unique version numbers in increasing order to mark new developments in the software. It is necessary for ensuring compatibility with other software. We call these version numbers release versions. They follow semantic versioning where we write version numbers using convention
major.minor.batch. 3 Versioning on different programming languages is done using the standard library or with the aid of a third-party library.
Modern software needs to adapt to a changing environment continually. It has lead modern software development to adopt agile development principles. Agile development interleaves specification, design, and implementation of software features. In agile, we develop software in small increments and short release cycles.
Each release cycle consists of multiple feature cycles in which each develops a feature into the software. Multiple features can be developed in parallel when multiple developers are collaborating on a project. We describe two feature cycles, one for working alone and one for collaborating with others. Furthermore, each feature cycle consists of a sequence of coding cycles.
Each coding cycle consists of the following steps:
- Write code, tests, and documentation.
- Run tests and build the documentation locally.
- Once the tests pass, and documentation builds without errors, commit the changes into version control.
We implement a part of the feature, such as a function or a structure, during each coding cycle. We can either create documentation and test for the parts immediately or wait until the whole feature is ready.
master commit ↓ commit # ┐ ⇣ # │ feature commit # ┘
A simple feature cycle without branching proceeds as follows:
- Repeat the coding cycle until the feature is ready.
- Push the commit into the remote repository.
When working alone, we can make all the changes in the master branch since no one else can push a conflicting commit during development. However, it is not feasible to collaborate effectively without branching.
master | feature commit | # branch ↓ | ↘ | commit ⇣ | ⇣ | commit ↓ | ↙ commit | # merge
A simple feature cycle with branching proceeds as follows:
- Create a new branch and name it after the feature.
- Repeat the coding cycle until the feature is ready.
- Merge the feature branch to the master branch.
- Push the commit into the remote repository.
When collaborating with others, it is useful to create a feature branch that allows development without conflicting commits on the branch. We have to resolve the conflicts only once we merge the feature back to the master branch. It is good practice to keep the features as small and contained as possible to avoid having to resolve lots of merge conflicts. It also makes merging easier for other developers.
Each release cycle releases a new version of the software. It proceeds as follows:
- We repeat the feature cycle until we have completed all desired features.
- Update version strings in the source code and configuration files.
- Commit the changes.
- Tag the commit with the version string.
- Push the tagged commit into the remote repository.
Now, the release tag is visible in the repository’s commit history.
Finally, we set up a continuous integration service to monitor changes in our remote repository. Pushing a commit to the remote repository invokes the continuous integration service to run automated processes such as building the software, running tests, and building documentation. If the commit is tagged with a version number, indicating a new version release, it may invoke additional processes such as distributing the software and documentation to users.
I wrote this article based on my experiences of applying software engineering principles, practices, and tools in developing software packages for scientific projects.
If you enjoyed or found benefit from this article, it would help me share it with other people who might be interested. If you have feedback, questions, or ideas related to the article, you can write to my GitHub Discussions forum.