While working to develop a computer vision model that detects redactions in documents obtained as a result of FOIA requests, I have encountered some tasks that I end up repeating over and over again. Most of the raw data in the problem domain exists in the form of PDFs. These PDF files contain scanned images of various government documents. I use these images as the training data for my model.
The things I have to do as part of the data acquisition and transformation process include the following:
- downloading all the PDF files linked to from a particular website, or series of web pages
- converting and splitting all the downloaded PDF files into appropriately sized individual image files suitable for use in a computer vision model
- generating statistics on the data being downloaded and processed, as well as (further down the line) things like detecting data drift for incoming training data
- splitting up data as appropriate for train / validation / test data sets
- extracting text data from the images via an OCR process
- versioning, syncing and uploading those images to an S3 bucket or some other cloud equivalent for use in the overall workflow
It’s not hard to see that many of these things likely apply to multiple machine learning data acquisition scenarios. While writing the code to handle these elements in my specific use case, I realised it might be worth gathering this functionality together in an agnostic tool that can handle some of these scenarios.
I had wanted to try out
nbdev ever since
it was announced back in 2019. The
concept was different to what I was used, but there were lots of benefits to be
had. I chose this small project to give it an initial trial run. I didn’t
implement all of the above features. The two notable missing parts are text
extraction and data versioning and/or synchronisation.
pdfsplitter is the
package I created to scratch that itch. It’s still very much a work in progress,
but I think I did enough with
nbdev to have an initial opinion.
I think I had postponed trying it out because I was worried about a steep learning curve. It turned out that an hour or two was all it took before I was basically up and running, with an understanding of all the relevant pieces that you generally use during the development lifecycle.
Built in to
nbdev in general is the ability to iterate quickly and driven by
short, small experiments. This is powered by Jupyter notebooks, which are sort
of the core of everything that
nbdev is about. If you don’t like notebooks,
you won’t like
nbdev. It’s a few years since it first saw the light of day as
a tool, and as such it felt like a polished way of working, and most of the
pieces of a typical development workflow were well accounted for. In fact, a lot
of the advantages come from convenience helpers of various kinds. Automatic
parallelised testing, easy submission to
Anaconda and PyPi
package repositories, automatic building of documentation and standardising
locations for making configuration changes. All these parts were great.
Perhaps the most sneakily pleasant part of using
nbdev was how it encouraged
best practices. There’s no concept of keeping test and documentation code in
separate silos away from the source notebooks. Following the best traditions of
nbdev encourages you
to do that as you develop. Write a bit of code here, write some narrative
explanation and documentation there, and write some tests over there to confirm
that it’s working in the way you expected. When Jeremy speaks of the significant
boost in productivity, I believe that a lot of it comes from the fact that so
much is happening in one place.
While working on
pdfsplitter, I had the feeling that I could just focus on the
problem at hand, building something to help speed up the process of importing
and generating images from PDF data for machine learning projects.
Not everything was peaches and roses, however. I ran into a weird mismatch with
the documentation pages generated and my GitHub fork of
nbdev since I was
main as the default branch but
nbdev still uses
master. I will be
submitting an issue to their repository, and it was an easy fix, but it was
confusing to struggle with that early on in my process. I’m also not sure how
nbdev will gel with large teams of developers, especially when they’re
working on the same notebooks / modules. I know
reviewnb exists now and even is used within
fastai for code reviews, but I would imagine an uphill battle trying to
persuade a team to take a chance with that.
I’ve been using VSCode at work, supercharged with GitHub Copilot and various other goodies, so it honestly felt like a bit of a step back to be forced to develop inside the Jupyter notebook interface, absent all of my tools. I also found the pre-made CLI functions a little fiddly to use — fiddly in the sense that I wish I’d set up some aliases for them early on as you end up calling them all the time. In fact, any time I made a change I would find myself making all these calls to build the library and then the documentation, not forgetting to run the tests and so on. That part felt a bit like busy work and I wish some of those steps could be combined together. Maybe I’m using it wrong.
All in all, I enjoyed this first few hours of contact with
nbdev and I will
continue to use it while developing
pdfsplitter. The experience was
also useful to reflect back into my current development workflow and
environment, especially when it comes to keeping that close relationship between
the code, documentation and tests.