Software Engineering Team CU Dept. of Biomedical Informatics

Tip of the Week: Python Packaging as Publishing

Tip of the Week: Python Packaging as Publishing

Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to #software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com

Python packaging is the craft of preparing for and reaching distribution of your Python work to wider audiences. Following conventions for packaging help your software work become more understandable, trustworthy, and connected (to others and their work). Taking advantage of common packaging practices also strengthens our collective superpowers: collaboration. This post will cover preparation aspects of packaging, readying software work for wider distribution.

TLDR (too long, didn’t read);

Use Pythonic packaging tools and techniques to help avoid code decay and unwanted code smells and increase your development velocity. Increase understanding with unsurprising directory structures like those exhibited in pypa/sampleproject or scientific-python/cookie. Enhance trust by being authentic on source control systems like GitHub (by customizing your profile), staying up to date with the latest supported versions of Python, and using security linting tools like PyCQA/bandit through visible + automated GitHub Actions ✅ checks. Connect your projects to others using CITATION.cff files, CONTRIBUTING.md files, and using environment + packaging tools like poetry to help others reproduce the same results from your code.

Why practice packaging?

How are a page with some text and a book different?
How are a page with some text and a book different?

The practice of Python packaging efforts is similar to that of publishing a book. Consider how a bag of text is different from a book. How and why are these things different?

Code undergoing packaging to achieve understanding, trust, and connection for an audience.
Code undergoing packaging to achieve understanding, trust, and connection for an audience.

These can be thought of metaphors when it comes to packaging in Python. Books have a smell which sometimes comes from how it was stored, treated, or maintained. While there are pleasant book smells, they might also smell soggy from being left in the rain or stored without maintenance for too long. Just like books, software can sometimes have negative code smells indicating a lack of care or less sustainable condition. Following good packaging practices helps to avoid unwanted code smells while increasing development velocity, maintainability of software through understandability, trustworthiness of the content, and connection to other projects.

Note: these techniques can also work just as well for inner source collaboration (private or proprietary development within organizations)! Don’t hesitate to use these on projects which may not be public facing in order to make development and maintenance easier (if only for you).

“Wait, what are Python packages?”

my_package/
│   __init__.py
│   module_a.py
│   module_b.py

A Python package is a collection of modules (.py files) that usually include an “initialization file” __init__.py. This post will cover the craft of packaging which can include one or many packages.

Understanding: common directory structures

project_directory
├── README.md
├── LICENSE.txt
├── pyproject.toml
├── docs
│   └── source
│       └── index.md
├── src
│   └── package_name
│       └── __init__.py
│       └── module_a.py
└── tests
    └── __init__.py
    └── test_module_a.py

Python Packaging today generally assumes a specific directory design. Following this convention generally improves the understanding of your code. We’ll cover each of these below.

Project root files

project_directory
├── README.md
├── LICENSE.txt
├── pyproject.toml
│ ...

Project sub-directories

project_directory
│ ...
├── docs
│   └── source
│       └── index.md
├── src
│   └── package_name
│       └── __init__.py
│       └── module_a.py
└── tests
    └── __init__.py
    └── test_module_a.py

Common directory structure examples

The Python directory structure described above can be witnessed in the wild from the following resources. These can serve as a great resource for starting or adjusting your own work.

Trust: building audience confidence

How much does your audience trust your work?.
How much does your audience trust your work?.

Building an understandable body of content helps tremendously with audience trust. What else can we do to enhance project trust? The following elements can help improve an audience’s trust in packaged Python work.

Source control authenticity

Comparing the difference between a generic or anonymous user and one with greater authenticity.
Comparing the difference between a generic or anonymous user and one with greater authenticity.

Be authentic! Fill out your profile to help your audience know the author and why you do what you do. See here for GitHub’s documentation on filling out your profile. Doing this may seem irrelevant but can go a long way to making technical work more relatable.

Staying up to date with supported Python releases

Major Python releases and their support status.
Major Python releases and their support status.

Use Python versions which are supported (this changes over time). Python versions which are end-of-life may be difficult to support and are a sign of code decay for projects. Specify the version of Python which is compatiable with your project by using environment specifications such as pyproject.toml files and related packaging tools (more on this below).

Security linting and visible checks with GitHub Actions

Make an effort to inspect your package for known security issues.
Make an effort to inspect your package for known security issues.

Use security vulnerability linters to help prevent undesirable or risky processing for your audience. Doing this both practical to avoid issues and conveys that you care about those using your package!

The green checkmark from successful GitHub Actions runs can offer a sense of reassurance to your audience.
The green checkmark from successful GitHub Actions runs can offer a sense of reassurance to your audience.

Combining GitHub actions with security linters and tests from your software validation suite can add an observable ✅ for your project. This provides the audience with a sense that you’re transparently testing and sharing results of those tests.

Connection: personal and inter-package relationships

How does your package connect with other work and people?
How does your package connect with other work and people?

Understandability and trust set the stage for your project’s connection to other people and projects. What can we do to facilitate connection with our project? Use the following techniques to help enhance your project’s connection to others and their work.

Acknowledging authors and referenced work with CITATION.cff

figure image

Add a CITATION.cff file to your project root in order to describe project relationships and acknowledgements in a standardized way. The CFF format is also GitHub compatible, making it easier to cite your project.

Reaching collaborators using CONTRIBUTING.md

CONTRIBUTING.md documents can help you collaborate with others.
CONTRIBUTING.md documents can help you collaborate with others.

Provide a CONTRIBUTING.md file to your project root so as to make clear support details, development guidance, code of conduct, and overall documentation surrounding how the project is governed.

Environment management reproducibility as connected project reality

Environment and packaging managers can help you connect with your audience.
Environment and packaging managers can help you connect with your audience.

Code without an environment specification is difficult to run in a consistent way. This can lead to “works on my machine” scenarios where different things happen for different people, reducing the chance that people can connect with a shared reality for how your code should be used.

“But why do we have to switch the way we do things?” We’ve always been switching approaches (software approaches evolve over time)! A brief history of Python environment and packaging tooling:

  1. distutils, easy_install + setup.py
    (primarily used during 1990’s - early 2000’s)
  2. pip, setup.py + requirements.txt
    (primarily used during late 2000’s - early 2010’s)
  3. poetry + pyproject.toml
    (began use around late 2010’s - ongoing)

Using Python poetry for environment and packaging management

figure image

Poetry is one Pythonic environment and packaging manager which can help increase reproducibility using pyproject.toml files. It’s one of many other alternatives such as hatch and pipenv.

poetry directory structure template use
user@machine % poetry new --name=package_name --src .
Created package package_name in .

user@machine % tree .
.
├── README.md
├── pyproject.toml
├── src
│   └── package_name
│       └── __init__.py
└── tests
    └── __init__.py

After installation, Poetry gives us the ability to initialize a directory structure similar to what we presented earlier by using the poetry new ... command. If you’d like a more interactive version of the same, use the poetry init command to fill out various sections of your project with detailed information.

poetry format for project pyproject.toml
# pyproject.toml
[tool.poetry]
name = "package-name"
version = "0.1.0"
description = ""
authors = ["username <email@address>"]
readme = "README.md"
packages = [{include = "package_name", from = "src"}]

[tool.poetry.dependencies]
python = "^3.9"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Using the poetry new ... command also initializes the content of our pyproject.toml file with opinionated details (following the recommendation from earlier in the article regarding declared Python version specification).

poetry dependency management
user@machine % poetry add pandas

Creating virtualenv package-name-1STl06GY-py3.9 in /pypoetry/virtualenvs
Using version ^2.1.0 for pandas

...

Writing lock file

We can add dependencies directly using the poetry add ... command. This command also provides the possibility of using a group flag (for example poetry add pytest --group testing) to help organize and distinguish multiple sets of dependencies.

Running Python from the context of poetry environments
% poetry run python -c "import pandas; print(pandas.__version__)"

2.1.0

We can invoke the virtual environment directly using the poetry run ... command.

Building source code with poetry
% pip install git+https://github.com/project/package_name

Even if we don’t reach wider distribution on PyPI or elsewhere, source code managed by pyproject.toml and poetry can be used for “manual” distribution (with reproducible results) from GitHub repositories. When we’re ready to distribute pre-built packages on other networks we can also use the following:

% poetry build

Building package-name (0.1.0)
  - Building sdist
  - Built package_name-0.1.0.tar.gz
  - Building wheel
  - Built package_name-0.1.0-py3-none-any.whl

Poetry readies source-code and pre-compiled versions of our code for distribution platforms like PyPI by using the poetry build ... command. We’ll cover more on these files and distribution steps with a later post!

Previous post
Tip of the Week: Using Python and Anaconda with the Alpine HPC Cluster
Next post
Tip of the Week: Data Quality Validation through Software Testing Techniques