Miscellaneous questions related to recent technical committee meeting

I attended the recent OpenCRVS technical committee meeting. It was interesting to hear where the project is and where it’s heading. Unfortunately I couldn’t stick around longer to ask some questions that I was wondering about during the presentation. So here are some things I wanted to ask related to the presentation.

Elasticsearch / Kibana usage. Is the project using the free licenses for these? Have you hit any limitations of the free license? If so, have you considered any alternatives, or is ES / Kibana serving the needs of the project for now?

Infrastructure migrations. During the presentation some smaller and larger migration needs were discussed (alternative orchestrators, updating dependencies etc.). This got me thinking, what exactly is the responsibility of the OpenCRVS project itself? Is it just the software? Or are you also responsible for deployments and managing them?

If OpenCRVS is also responsible for (some / all?) deployments, do you have service level agreements for those? Are zero downtime deployments necessary, or can these migrations be done with downtime? Zero downtime deployments / migrations can be quite complicated to achieve depending on the changes that need to happen.

Quality gates and performance testing. I noticed performance testing was only mentioned with the release quality gate. Is there no performance testing (automated or otherwise) being executed in the earlier quality checks? Is this testing automated or manual? Could a subset of it be automated for some sort of continuous testing on the development branch?

My reasoning for continuous performance testing is that often release phase performance testing makes it harder to figure out when or where the performance degradation has happened, at least if there are a lot of changes between releases. There’s also usually a lot of stuff going on when new releases are being prepared, so having unexpected performance degradations to figure out can also add unnecessary stress to that process.

There was also an open question related to release cadence during the meeting. My personal preference in most projects has been to do releases as often as possible. This has the benefit of making the process more familiar and makes it easier to recognize parts of the process that are always repeated in the same way. These parts can then hopefully be automated, making future releases even easier and removing some chances of human error.

Kubernetes proof of concept. When an alternative to Docker Swarm was investigated, were there any other container orchestration methods being looked at? Is the plan to use / support the base distribution of Kubernetes or something else? I think offering an alternative to Docker Swarm is a good idea, and Kubernetes is most likely the best known orchestration solution out there. It’s also a fairly complex system that offers fairly low level building blocks. You can build neat systems on top of it, but it does require work to get going AND to keep it going. Kubernetes version updates are pushed at a fairly steady pace, which leads to API deprecations etc. Lots of stuff to track to keep your deployments up to date.

Kafka / RabbitMQ and queues in general were mentioned as one solution to handle asynchronous event handling between the microservices of OpenCRVS. Is the team familiar with Kafka? Having been a Kafka consumer (at best, have very little experience with it), my understanding of it is that it’s another fairly complex system to maintain / implement properly. I was mainly wondering, if the added on complexity of Kafka is worth it. What are the main issues Kafka would be deployed to solve? Obviously my understanding of the OpenCRVS project is fairly rudimentary. My personal preference just tends to be to prefer simple solutions as long as those work, switching to more complex ones as is necessary.

These are the questions I wrote down during the meeting. I’m sorry I couldn’t stay and discuss these during the meeting, but I had another one already ongoing when we got to the Q&A portion.

Hi Juuso,

Thank you very much for attending the meeting and for sharing your questions. I’ll do my best to reply to each of them in a brief way and will follow up with more detail if needed later.

  1. Elasticsearch / Kibana

We are committed to OpenSource, so we are using the free licences in the core product. As far as search is concerned, we have entirely what we require in ES. Kibana is where there are a number of monitoring and analytics features that could be taken advantage of in the paid licences. Particularly external, 3rd party monitoring tools in the cloud are very useful to implementers. Implementers can take up the option themselves to pay for a license if they want that, but we also supply hooks into Solarwinds Loggly or Pingdom so at least there are options in the existing stack. We are also looking at Grafana / Prometheus, Clickhouse and Metabase and intend to offer as much monitoring choice as possible over time.

  1. Infrastructure Migration

This is a great question and one we often ask ourselves. We are not responsible for deployments and do not manage them directly, but we intend to offer as much support as we can to implementers. In reality we will be asked to investigate incidents and help implementers resolve them in order for OpenCRVS to be successful. We would no doubt provide SLA support to implementing governments that outlined our commitment to respond to bugs in Core in a timely fashion. For that reason, we must ensure that migrating between versions is as seamless and easy to do as possible.

  1. Zero downtime deployments

This is devops we would like to invest in. It takes about 10 minutes for any update to deploy according to the way our devops is set up in Docker Swarm. We may invest time in zero downtime deployments in Docker Swarm, but in the meantime we would encourage migrations to happen after office hours. I think that our Kubernetes POC will make it easier to design and develop devops to manage zero downtime deployments.

  1. Quality gates

Perhaps I didnt explain very well, but the vast majority of our tests (unit / end-to-end / build / deploy) are automated in Github Actions and performed on various different branches (develop / release / hotfix ) depending on the gate. It’s a great idea though for us to run smaller scope performance tests on minor and hotfix releases. We will look into that and get back to you in this discussion group with what we propose. We are still quite early in our performance testing journey and require a lot more performance tests to be written for different business functions, not just for the write heavy registration functions but also for read heavy searching and deduplication functions and perhaps also a hybrid.

  1. Release cadence

My preference also would be to release often and semanticly, however there are many complexities in the Digital Public Good space that make that harder and additionally we have QA and development capacity issues in the core team. We have spent some time in internal discussions since the meeting and I will be publishing a blog this month outlining our approach. I will share a link to that in this thread and would appreciate your thoughts when it is published.

  1. Kubernetes

We havent looked at other methods other than Kubernetes to be honest. I think that we would have to support the base distribution, along with utilising Helm. We are looking into zero ops approaches like https://microk8s.io/. If you have any other ideas please let us know as we would really appreciate contributions.

  1. Kafka

Regarding Kafka, the solution we are trying to solve is two-fold. 1. We want to refactor out asynchronous requests into an independent system that can process them in periods of low traffic thus decreasing stress on modules and 2. to have end-to-end transaction visibility and message storage. We want to be able to respond to beneficiary incidents, track and re-run a transaction through the entire system in case of any failures.

Thanks again for your questions Juuso. We’d love to involve you as we take these forward. Please keep in touch!

All the best,

Euan

1 Like

Thank you for taking the time to reply, looking forward to seeing that blog post about releases!

As far as alternative Kubernetes distributions go, I unfortunately only have experience with managed Kubernetes, more specifically Azure Kubernetes Service. I think that’s basically the base distribution plus some custom Azure tooling on top. Even as a managed service, it seems to have a lot of maintenance overhead the end user is responsible for. Wouldn’t really call it a fire and forget kind of application runtime. But like I said, offering an alternative to Docker Swarm makes a lot of sense, and Kubernetes seems to be the biggest thing out there in terms of container orchestration, so supporting that makes a lot of sense too.

1 Like