Architecture

Prelude

The initial scope of proofbuddy + ISM was to run with up to 20 users with each having one or more isabelle server session(s).

ISM was introduced as a middleware between the proofbuddy backend and isabelle servers to address three main goals:

  • Scalability1: Ensure we can handle all users, i.e., utilize all resources
  • Availability: ensuring the backend can always get a connection and session2
  • Security: ensuring users don't break our system with the freedom provided by isabelle server

ISM

ISM consists of 2 main components:

  • agent: Handles metric collection and server pool management
  • gateway: Handles connection management and proxying messages between clients <> servers

Both work together and are mostly event oriented with some components borrowing parts of the actor model.

The diagram below illustrates their relationship and responsibilities from an abstract view.

ISM Arch overview

The events for both can be roughly divided into:

  • client connections (accepting, close)
  • messages, i.e., network writes
    • from client connections
    • from server connections, including derived events like (a)sync tasks
  • prometheus metrics collection
  • server status events: shutdown, start, startup done3

The main background tasks are:

  • (agent) server pool management: Ensuring isabelle server instances are available
    • Replenishing the pool: Starting new instances when required4
    • Stopping unhealthy instances: Stopping instances that become unresponsive or indicate other unhealthy behaviors
  • (agent + gateway) task tracking: The agent tracks the duration of all commands
    • this is done in cooperation with the gateway
    • this component is crucial to enable, e.g., orphaned session cleanup
  • (agent) service health tracking: Track systemd service health and events via dbus
  • (executor):
    • server instace management
    • management of server ressources (storage, networking)

Info

The agent also exposes, when configured, all the profiles listed in runtime/pprof.Profile interface via HTTP. This is essential for profiling.

1

There was no initial knowledge about the scalability of isabelle server. It has proven to scale easily to the required amount of clients. The current service instance pooling is mainly to provide better availability because instance can die. However, ISM will eventually expand on this by supporting clustered operation.

2

Provided enough resources of course.

4

The server pool has a desired number of instances that should always be available. More instances are okay but at least a certain static number is desired to ensure availability on server failure.

3

An isabelle server requires a bit to startup, a patch allows us to know when it's available via sd_notify.