Around one year ago, on September 7, I released the first stable version of my first “real” open source project: json_repair
If you are not familiar with it, it’s a simple python package that takes a broken json string and returns a version of it that is formally correct and usable by code.
Why did I create it? I was playing around with GPT and I needed some level of structure, I knew that Large Language Models were trained on plenty of JSON data and it was the standard way of exposing structured data. I documented that project in another article:
Except…
Except that an LLM isn’t really guaranteed to return anything remotely valid and usable, it will return it most of the time but not always.
So I wrote this library and open sourced it.
Since then lots of companies and researchers have adopted it, I have seen referrals from Alibaba, Adobe, Google, Microsoft1 and a very vibrant Chinese research community.
Today the package is integrated in quite a few open source projects and some demo code from Microsoft and AWS.
You can see all public open source projects using it here and here an overview of the number of downloads.
I didn’t write this to celebrate this achievement, rather to talk about the things I learned this year as a solo developer and open source maintainer.
Hopefully it would be useful for you.
What I learned about testing
Test Driven Development
How useful it is in the long term: 5/5
How hard it is to implement: 3/5
When I started writing the library I didn’t think about the implications of supporting multiple use cases or any structured approach at ensuring that my library was working as expected.
I had to come up with something really really fast because I was going insane.
So I chose an approach I knew but never used extensively: Test Driven Development or TDD.
It was useful in two ways:
To make sure the library is always working, that any bug reported has a test associated, not breaking backward compatibility etc etc
As documentation, if you want to know if your use case is supported by the library you can check the tests
It has been extremely useful so far, the main downside is that it’s extremely easy to lose control of your tests and create a ton of redundancy.
I have already refactored the test suite once, I am expecting to have to do this again if development continues at this speed.
Code Coverage
How useful it is in the long term: 1/5
How hard it is to implement: 1/5
Alongside the testing suite I also implemented code coverage and achieved 100% coverage.
Code coverage in python is trivial to implement but it wasn’t really useful for me. I found some dead code and adjusted some test cases to cover for unlikely edge cases.
I would have skipped this if this wasn’t an hobbyist project I do for passion and learning.
Performance testing
How useful it is in the long term: 3/5
How hard it is to implement: 1/5
At the very beginning of the adoption of this library, I found that some users were very concerned about performance impact of a library that potentially needs to scan a lot of calls and text, so I implemented some basic performance testing:
At first I optimized the code to improve performance by over 50%
Then I introduced regression tests for performance, to ensure code changes do not dramatically decrease performance.
It was fun!
And I learned a lot about optimizing python code: for example if/then and try/except can be used for the same scope depending on the likelihood of throwing an exception.
In my most used helper functions that made a huge difference in terms of performance.
Was this particularly useful? I don’t think so.
The library was already fast, I could have profiled the library once and not spend so much time. So I wouldn’t advise to use this practice unless you are really implementing critical workloads.
What I learned about coding best practices
Code copilots
How useful it is in the long term: 5/5
How hard it is to implement: 1/5
Let’s be honest, I wouldn’t have done this whole project without having GPT-4 as copilot.
It wrote the base version of the parser and I used it time and time again (I code with Cursor) to produce performance optimizations and solve some obscure errors and warnings (especially with mypy).
I don’t think AI is coming for the jobs of developers, many times the solutions proposed weren’t correct, but it always gave me directionally correct answers and saved me a ton of time.
You can think of AI copilots as a Senior developer that isn’t working with you but can come over to your monitor and give you advice from time to time.
Enforcing coding styles and standards
How useful it is in the long term: 5/5
How hard it is to implement: 1/5
One of the best features of the Python ecosystem is the fact that you can enforce coding styles and standards quite easily.
Even though I was a solo developer, this was extremely useful, and I used pre-commit in order to not only enforce coding standards but also perform unit tests and performance testing before a commit is pushed.
Absolutely must for everyone using Python
Use types in python
How useful it is in the long term: 2/5
How hard it is to implement: 4/5
Python isn’t a strongly typed language but it has the ability to specify which types are expected for function definitions (and not only functions).
This is useful for catching errors and for downstream clients of the library, but it’s a lot of work. Lots of people using the library like to use mypy so I put the effort in.
Honestly I don’t see much use for this approach if you are not going to distribute your code widely.
CI/CD with Github Actions
How useful it is in the long term: 5/5
How hard it is to implement: 1/5
Early in the project I got a report that my library wasn’t working for python 3.7, while I wasn’t particularly worried about that, it made me rethink the way I manage my code.
So I created a suite of actions in GitHub to test against multiple python versions and to create a CI/CD pipeline when I release my code and a new version needs to be pushed to pipit.
Life saver, will bring this approach with me forever.
What I learned about being a maintainer
Issue reports, Feature Requests and Pull Requests
How useful it is in the long term: 5/5
How hard it is to implement: 2/5
Especially in the beginning I got a lot of feedback and feature requests, some more useful some less useful.
After a while I learned to cut through the noise and use my product manager hat:
Never ignore users need. It might be obscure but, unless it was too expensive to do, I decided to support everything I could.
Only accept real world use cases. As it turns out, there are a few people that go around opening issues to projects after they do abstract testing of edge cases but are not experiencing any issue in their actual use. I decided to ignore those, each new feature complicates the code, each new feature is time I could spend doing something else.
Write proper triaging steps and pull requests templates so that users can debug before coming to me.
This is probably one of those cases in which my industry experience was extremely useful in dealing with open source.
Having good README and demo site
How useful it is in the long term: 1/5
How hard it is to implement: 1/5
This is a nice to have, I haven’t invested much in promoting the project, mostly I claimed the easiest name for a library in pypi so the adoption is organic. But I had fun creating lots of additional content to introduce users to my project.
Sponsorship
How useful it is in the long term: 1/5
How hard it is to implement: 1/5
This spring I added Github sponsor, if you want to sponsor me you can use this link https://github.com/sponsors/mangiucugna.
Not because I need money but because a) I wanted to experiment with it and see how generous people are b) if you make money out of the work of someone else, you should be decent and share a few bucks with them.
I did not expect to get even a cent, instead I have received around 100$ already.
Not a sum that will allow me to do this full time but good for the spirit.
Final thoughts
I invested a lot of time on this library, this is an hobbyist project that I am doing because my day-to-day isn’t coding anymore.
The amount of time spent on refining this library on the edges is frankly unreasonable for a for-profit project, and I think that is fine. Many open source projects are a product of love and have been extremely useful for the world.
But if your company doesn’t sink so much time on the code you work on, that’s why you get paid for it while I am doing this for free!
How do I know? It turns out that many big companies linked my github in their PR for code review, so I could see the referral from their internal code repository.