Tip of the Week: Python Packaging as Publishing
Tip of the Week: Python Packaging as Publishing
Each week we seek to provide a software tip of the week geared towards helping you achieve your software goals. Views expressed in the content belong to the content creators and not the organization, its affiliates, or employees. If you have any software questions or suggestions for an upcoming tip of the week, please don’t hesitate to reach out to #software-engineering on Slack or email DBMISoftwareEngineering at olucdenver.onmicrosoft.com
Python packaging is the craft of preparing for and reaching distribution of your Python work to wider audiences. Following conventions for packaging help your software work become more understandable, trustworthy, and connected (to others and their work). Taking advantage of common packaging practices also strengthens our collective superpowers: collaboration. This post will cover preparation aspects of packaging, readying software work for wider distribution.
TLDR (too long, didn’t read);
Use Pythonic packaging tools and techniques to help avoid code decay and unwanted code smells and increase your development velocity. Increase understanding with unsurprising directory structures like those exhibited in pypa/sampleproject
or scientific-python/cookie
. Enhance trust by being authentic on source control systems like GitHub (by customizing your profile), staying up to date with the latest supported versions of Python, and using security linting tools like PyCQA/bandit
through visible + automated GitHub Actions ✅ checks. Connect your projects to others using CITATION.cff
files, CONTRIBUTING.md
files, and using environment + packaging tools like poetry
to help others reproduce the same results from your code.
Why practice packaging?
The practice of Python packaging efforts is similar to that of publishing a book. Consider how a bag of text is different from a book. How and why are these things different?
- A book has commonly understood sequencing of content (i.e. copyright page, then title page, then body content pages…).
- A book often cites references and acknowledges other work explicitly.
- A book undergoes a manufacturing process which allows the text to be received in many places the same way.
These can be thought of metaphors when it comes to packaging in Python. Books have a smell which sometimes comes from how it was stored, treated, or maintained. While there are pleasant book smells, they might also smell soggy from being left in the rain or stored without maintenance for too long. Just like books, software can sometimes have negative code smells indicating a lack of care or less sustainable condition. Following good packaging practices helps to avoid unwanted code smells while increasing development velocity, maintainability of software through understandability, trustworthiness of the content, and connection to other projects.
Note: these techniques can also work just as well for inner source collaboration (private or proprietary development within organizations)! Don’t hesitate to use these on projects which may not be public facing in order to make development and maintenance easier (if only for you).
“Wait, what are Python packages?”
my_package/
│ __init__.py
│ module_a.py
│ module_b.py
A Python package is a collection of modules (.py
files) that usually include an “initialization file” __init__.py
. This post will cover the craft of packaging which can include one or many packages.
Understanding: common directory structures
project_directory
├── README.md
├── LICENSE.txt
├── pyproject.toml
├── docs
│ └── source
│ └── index.md
├── src
│ └── package_name
│ └── __init__.py
│ └── module_a.py
└── tests
└── __init__.py
└── test_module_a.py
Python Packaging today generally assumes a specific directory design. Following this convention generally improves the understanding of your code. We’ll cover each of these below.
Project root files
project_directory
├── README.md
├── LICENSE.txt
├── pyproject.toml
│ ...
- The
README.md
file is a markdown file with documentation including project goals and other short notes about installation, development, or usage. TheREADME.md
file is akin to a book jacket blurb which quickly tells the audience what the book will be about. - The
LICENSE.txt
file is a text file which indicates licensing details for the project. It often includes information about how it may be used and protects the authors in disputes. TheLICENSE.txt
file can be thought of like a book’s copyright page. See https://choosealicense.com/ for more details on selecting an open source license. - The
pyproject.toml
file is a Python-specific TOML file which helps organize how the project is used and built for wider distribution. Thepyproject.toml
file is similar to a book’s table of contents, index, and printing or production specification.
Project sub-directories
project_directory
│ ...
├── docs
│ └── source
│ └── index.md
├── src
│ └── package_name
│ └── __init__.py
│ └── module_a.py
└── tests
└── __init__.py
└── test_module_a.py
- The
docs
directory is used for in-depth documentation and related documentation build code (for example, when building documentation websites, aka “docsites”). Thedocs
directory includes information similar to a book’s “study guide”, providing content surrounding how to best make use of and understand the content found within. - The
src
directory includes primary source code for use in the project. Python projects generally use a nested package directory with modules and sub-packages. Thesrc
directory is like a book’s body or general content (perhaps thinking of modules as chapters or sections of related ideas). - The
tests
directory includes testing code for validating functionality of code found in thesrc
directory. The above follows pytest conventions. Thetests
directory is for code which acts like a book’s early reviewers or editors, making sure that if you change things insrc
the impacts remain as expected.
Common directory structure examples
The Python directory structure described above can be witnessed in the wild from the following resources. These can serve as a great resource for starting or adjusting your own work.
Trust: building audience confidence
Building an understandable body of content helps tremendously with audience trust. What else can we do to enhance project trust? The following elements can help improve an audience’s trust in packaged Python work.
Source control authenticity
Be authentic! Fill out your profile to help your audience know the author and why you do what you do. See here for GitHub’s documentation on filling out your profile. Doing this may seem irrelevant but can go a long way to making technical work more relatable.
- Add a profile picture of yourself or something fun.
- Set your profile description to information which is both professionally accurate and unique to you.
- Show or link to work which you feel may be relevant or exciting to those in your audience.
Staying up to date with supported Python releases
Use Python versions which are supported (this changes over time).
Python versions which are end-of-life may be difficult to support and are a sign of code decay for projects. Specify the version of Python which is compatiable with your project by using environment specifications such as pyproject.toml
files and related packaging tools (more on this below).
- See here for updated information on Python version status.
- Staying up to date with supported releases oftentimes can result in performance or other similar benefits (later versions usually include improvements!).
Security linting and visible checks with GitHub Actions
Use security vulnerability linters to help prevent undesirable or risky processing for your audience. Doing this both practical to avoid issues and conveys that you care about those using your package!
-
PyCQA/bandit
`: checks Python code -
pyupio/safety
`: checks Python dependencies -
gitleaks
: checks for sensitive passwords, keys, or tokens
Combining GitHub actions with security linters and tests from your software validation suite can add an observable ✅ for your project. This provides the audience with a sense that you’re transparently testing and sharing results of those tests.
- See GitHub’s documentation on this topic for more information.
- See also the DBMI Software Engineering Team’s blog post: “Automate Software Workflows with Github Actions”
Connection: personal and inter-package relationships
Understandability and trust set the stage for your project’s connection to other people and projects. What can we do to facilitate connection with our project? Use the following techniques to help enhance your project’s connection to others and their work.
Acknowledging authors and referenced work with CITATION.cff
Add a CITATION.cff
file to your project root in order to describe project relationships and acknowledgements in a standardized way. The CFF format is also GitHub compatible, making it easier to cite your project.
- This is similar to a book’s credits, acknowledgements, dedication, and author information sections.
- See here for a
CITATION.cff
file generator (and updater).
Reaching collaborators using CONTRIBUTING.md
Provide a CONTRIBUTING.md
file to your project root so as to make clear support details, development guidance, code of conduct, and overall documentation surrounding how the project is governed.
- See GitHub’s documentation on “Setting guidelines for repository contributors”
- See opensource.guide’s section on “Writing your contributing guidelines”
Environment management reproducibility as connected project reality
Code without an environment specification is difficult to run in a consistent way. This can lead to “works on my machine” scenarios where different things happen for different people, reducing the chance that people can connect with a shared reality for how your code should be used.
“But why do we have to switch the way we do things?” We’ve always been switching approaches (software approaches evolve over time)! A brief history of Python environment and packaging tooling:
distutils
,easy_install
+setup.py
(primarily used during 1990’s - early 2000’s)pip
,setup.py
+requirements.txt
(primarily used during late 2000’s - early 2010’s)poetry
+pyproject.toml
(began use around late 2010’s - ongoing)
Using Python poetry
for environment and packaging management
Poetry is one Pythonic environment and packaging manager which can help increase reproducibility using pyproject.toml
files. It’s one of many other alternatives such as hatch
and pipenv
.
poetry
directory structure template use
user@machine % poetry new --name=package_name --src .
Created package package_name in .
user@machine % tree .
.
├── README.md
├── pyproject.toml
├── src
│ └── package_name
│ └── __init__.py
└── tests
└── __init__.py
After installation, Poetry gives us the ability to initialize a directory structure similar to what we presented earlier by using the poetry new ...
command. If you’d like a more interactive version of the same, use the poetry init
command to fill out various sections of your project with detailed information.
poetry
format for project pyproject.toml
# pyproject.toml
[tool.poetry]
name = "package-name"
version = "0.1.0"
description = ""
authors = ["username <email@address>"]
readme = "README.md"
packages = [{include = "package_name", from = "src"}]
[tool.poetry.dependencies]
python = "^3.9"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Using the poetry new ...
command also initializes the content of our pyproject.toml
file with opinionated details (following the recommendation from earlier in the article regarding declared Python version specification).
poetry
dependency management
user@machine % poetry add pandas
Creating virtualenv package-name-1STl06GY-py3.9 in /pypoetry/virtualenvs
Using version ^2.1.0 for pandas
...
Writing lock file
We can add dependencies directly using the poetry add ...
command. This command also provides the possibility of using a group
flag (for example poetry add pytest --group testing
) to help organize and distinguish multiple sets of dependencies.
- A local virtual environment is managed for us automatically.
-
A
poetry.lock
file is written when the dependencies are installed to help ensure the version you installed today will be what’s used on other machines. - The
poetry.lock
file helps ensure reproducibility when dealing with dependency version ranges (where otherwise we may end up using different versions which match the dependency ranges but observe different results).
Running Python from the context of poetry
environments
% poetry run python -c "import pandas; print(pandas.__version__)"
2.1.0
We can invoke the virtual environment directly using the poetry run ...
command.
- This allows us to quickly run code through the context of the project’s environment.
- Poetry can automatically switch between multiple environments based on the local directory structure.
- We can also the environment as a “shell” (similar to virtualenv’s
activate
) with thepoetry shell
command which enables us to leverage a dynamic session in the context of thepoetry
environment.
Building source code with poetry
% pip install git+https://github.com/project/package_name
Even if we don’t reach wider distribution on PyPI or elsewhere, source code managed by pyproject.toml
and poetry
can be used for “manual” distribution (with reproducible results) from GitHub repositories. When we’re ready to distribute pre-built packages on other networks we can also use the following:
% poetry build
Building package-name (0.1.0)
- Building sdist
- Built package_name-0.1.0.tar.gz
- Building wheel
- Built package_name-0.1.0-py3-none-any.whl
Poetry readies source-code and pre-compiled versions of our code for distribution platforms like PyPI by using the poetry build ...
command. We’ll cover more on these files and distribution steps with a later post!