Blog
Clear search
2024
Scientific software plays a crucial role in research. When your software is used to analyze data, simulate models, or derive scientific conclusions, ensuring its correctness becomes critical. A small bug can have a significant impact on your results, potentially invalidating months of work, or worse, causing the retraction of published research (for example: see here). Fortunately, software testing can help minimize such risks, giving you more confidence in your code and a greater chance to catch issues early. In this guide, we’ll walk through key types of software tests, practical advice for using popular testing tools like pytest and doctest, and how you can incorporate these into your scientific development workflow.
Test coverage is a crucial aspect of software development that helps ensure your code is reliable and bug-free. By measuring how much of your code is covered by tests, you can identify untested areas and improve overall quality. In this post, we’ll dive into what test coverage is, why it matters, and explore some tools for measuring it. Let’s get started!
Graph databases can offer a more natural and intuitive way to model and explore relationships within data. In this post, we’ll dive into the world of graph databases, focusing on Kùzu, an embedded graph database and query engine for a number of languages, and Cypher, a powerful query language designed for graph data. We’ll explore how these tools can transform your data management and analysis workflows, provide insights into their capabilities, and discuss when it might be more appropriate to use server-based solutions. Whether you’re a research software developer looking to integrate advanced graph processing into your applications or simply curious about the benefits of graph databases, this guide will equip you with the knowledge to harness the full potential of graph data.
Apache Parquet is a columnar and strongly-typed tabular data storage format built for scalable processing which is widely compatible with many data models, programming languages, and software systems. Parquet files (typically denoted with a .parquet filename extension) are typically compressed within the format itself and are often used in embedded or cloud-based high-performance scenarios. It has grown in popularity since it was introduced in 2013 and is used as a core data storage technology in many organizations. This article will introduce the Parquet format from a research data engineering perspective.
Writing software often entails using code from other people to solve common challenges and take advantage of existing work. External software used by a specific project can be called a “dependency” (the software “depends” on that external work to accomplish tasks). Collections of software are oftentimes made available as “packages” through various platforms. Package management for dependencies, the task of managing collections of dependencies for a specific project, is a specialized area of software development that can involve the use of unique tools and files. This article will cover package dependency management through special files generally referred to as “lockfiles”.
Have you ever run Python code only to find it taking forever to complete or sometime abruptly ending with an error like: 123456 Killed or killed (program exited with code: 137)? You may have experienced memory resource or management challenges associated with these scenarios. This post will cover some computer memory definitions, how Python makes use of computer memory, and share some tools which may help with these types of challenges.
2023
Thanksgiving is a holiday practiced in many countries which focuses on gratitude for good harvests of the preceding year. In the United States, we celebrate Thanksgiving on the fourth Thursday of November each year often by eating meals we create together with others. This post channels the spirit of Thanksgiving by giving our thanks through code as a “Codesgiving”, acknowledging and creating better software together.
Data orientated software development can benefit from a specialized focus on varying aspects of data quality validation. We can use software testing techniques to validate certain qualities of the data in order to meet a declarative standard (where one doesn’t need to guess or rediscover known issues). These come in a number of forms and generally follow existing software testing concepts which we’ll expand upon below. This article will cover a few tools which leverage these techniques for addressing data quality validation testing.
Python packaging is the craft of preparing for and reaching distribution of your Python work to wider audiences. Following conventions for packaging help your software work become more understandable, trustworthy, and connected (to others and their work). Taking advantage of common packaging practices also strengthens our collective superpowers: collaboration. This post will cover preparation aspects of packaging, readying software work for wider distribution.
This post is intended to help demonstrate the use of Python on Alpine, a High Performance Compute (HPC) cluster hosted by the University of Colorado Boulder’s Research Computing. We use Python here by way of Anaconda environment management to run code on Alpine. This readme will cover a background on the technologies and how to use the contents of an example project repository as though it were a project you were working on and wanting to run on Alpine.
There are many routine tasks which can be automated to help save time and increase reproducibility in software development. GitHub Actions provides one way to accomplish these tasks using code-based workflows and related workflow implementations. This type of automation is commonly used to perform tests, builds (preparing for the delivery of the code), or delivery itself (sending the code or related artifacts where they will be used).
Git provides a feature called branching which facilitates parallel and segmented programming work through commits with version control. Using branching enables both work concurrency (multiple people working on the same repository at the same time) as well as a chance to isolate and review specific programming tasks. This article covers some conceptual best practices with branching, reviewing, and merging code using Github.
This article covers using the software technique of linting on R code in order to improve code quality, development velocity, and collaboration.
Programming often involves long periods of problem solving which can sometimes lead to unproductive or exhausting outcomes. This article covers one way to avoid less productive time expense or protect yourself from overexhaustion through a technique called “timeboxing” (also sometimes referenced as “timeblocking”).
Software documentation is sometimes treated as a less important or secondary aspect of software development. Treating documentation as code allows developers to version control the shared understanding and knowledge surrounding a project. Leveraging this paradigm also enables the use of tools and patterns which have been used to strengthen code maintenance. This article covers one such pattern: linting, or static analysis, for documentation treated like code.
2022
The act of creating software often involves many iterations of writing, personal collaborations, and testing. During this process it’s common to lose awareness of code which is no longer used, and thus may not be tested or otherwise linted. Unused code may contribute to “software decay”, the gradual diminishment of code quality or functionality. This post will cover software decay and strategies for addressing unused code to help keep your code quality high.
Apache Arrow is a language-independent and high performance data format useful in many scenarios. DuckDB is an in-process SQL-based data management system which is Arrow-compatible. In addition to providing a SQLite-like database format, DuckDB also provides a standardized and high performance way to work with Arrow data where otherwise one may be forced to language-specific data structures or transforms.
Diagrams can be a useful way to illuminate and communicate ideas. Free-form drawing or drag and drop tools are one common way to create diagrams. With this tip of the week we introduce another option: diagrams as code (DaC), or creating diagrams by using code.
Have you ever found yourself spending hours formatting your code so it looks just right? Have you ever caught a duplicative import statement in your code? We recommend using open source linting tools to help avoid common issues like these and save time.