After deploying a self-sovereign identity (SSI) solution for your organization, you may be interested in digging into the lower layers of the stack to see how SSI really works.
Recently, we published a technical guide that details how to troubleshoot a Hyperledger Indy network. This work is based on our experience of debugging problems both on public Sovrin networks and in internal test pools during the last three years. This blog post is a less technical version of that guide.
In order to understand how to troubleshoot a system, one needs some background knowledge on how it works. It starts with how Indy Node implements a distributed ledger—or in other words, a replicated and highly available database that is capable of withstanding crashes and the malicious behavior of individual nodes. This is achieved using the RBFT (Redundant Byzantine Fault Tolerance) consensus protocol, which is a leader-based protocol for agreeing upon the global order of transactions across all non-faulty nodes in a pool. In other words, it means that all nodes should always end up with equal transaction logs (which are called ledgers and have a merkle tree on top of them for quick consistency checks) and state (which is basically a key-value storage also capable of quick consistency checks thanks to merkle patricia trie under the hood). Indy Node can tolerate up to 1/3 (rounded down) faulty nodes in its pool, and pool membership is permissioned.
Writes to the ledger are done in batches that are proposed by the leader node, and then agreed upon by the rest of the pool, following the RBFT protocol. This means that performance of an Indy network is limited by the performance of the leader node, so there is a sub-protocol for changing it called a view change. A view change starts when enough nodes detect and agree (through exchanging votes) that the current leader is misbehaving, down/offline, proposing batches too slowly, or censoring some transactions.
Detecting intentional performance degradation is a tricky task. In RBFT, it is solved by using a number of backup protocol instances, each with its own leader distinct from the leader of the master instance. Each protocol instance implements the three-phase commit from PBFT (Practical Byzantine Fault Tolerance) and spans all nodes (meaning all nodes run all protocol instances at the same time). If the master protocol instance starts consistently lagging behind some of the backups, it means that it should be possible to improve network performance by changing the master leader through a view change. Note that protocol instances are fully independent of each other, and only transactions ordered by the master instance are actually executed. Backup instances are used only for performance comparison.
Sometimes individual nodes can lag behind the pool—after being offline for maintenance, for example. In this case they can use a sub-protocol called a “catch up,” which is much faster than normal ordering through a three-phase commit, since it just downloads available transaction logs from other nodes. Note that during this process the node doesn’t trust other nodes blindly but performs consistency checks of downloaded chunks.
Another notable piece of Indy Node functionality is its automatic upgrade process, which is done with the help of a sidecar process. Upon request from the main process (triggered by an upgrade transaction) it can stop the main process and perform an upgrade (in the case of Ubuntu, using an apt package manager) and start the main process again.
Points of failure
One may ask: “How can such a resilient system fail?” This is an interesting question.
Unfortunately, in a real-world network, connections can be flaky, individual nodes can fail, and software can have bugs. While individual failures usually don’t affect the Indy network as a whole (otherwise there wouldn’t be any sense to build such a complicated system), they might add up when left unattended, and this could lead to serious damage. So, rule #1 is to do regular health checks and fix problems as soon as possible, so that they don’t accumulate.
One common example of failures could be firewall misconfigurations, which prevent nodes from connecting to each other. While these failures are easy to detect by just observing connectivity reports (obtained using a VALIDATOR_INFO diagnostic transaction, for instance) and they often don’t look serious (after all it is “just” an environment problem affecting just a couple of nodes), they are still dangerous and need to be treated as soon as possible.
Another quite common example of a problem are nodes failing to properly perform an upgrade. This is a much more dangerous situation because it can affect all nodes at once. Most often this problem arises from some unhandled edge case in the apt repository and local state management. So before releasing a new version of Indy Node, the upgrade process should always be thoroughly tested using a test pool that mimics the production network.
Due to the complexity of the consensus protocol, some bugs may be lurking in Indy Node itself, even though a lot of them were found and fixed during testing. Usually they are not immediately dangerous and lead just to some transient consensus failures on a small subset of the network, which usually can be fixed just by restarting affected nodes. Note that the implementation of promotion and demotion of nodes is quite complex, which means that normally they shouldn’t be used as an attempt at a “quick fix” of such problems because there is a significant chance that they can make the situation even worse.
Now, the most dangerous types of failures are ledger corruption, or divergent transaction logs on different nodes. The latter case was observed in the past on test pools during a view change while under high load, and it was fixed by implementing a proper PBFT view change protocol.
Such failures have never been observed on the Sovrin MainNet, and we hope that it will stay that way.
What to do when it fails
If the Indy network failed in some way there are two most important things to do:
- repair it as soon as possible (possibly fixing just consequences, and not the cause)
- analyze the cause of the failure, and fix it
For the first item, we have a separate section in our troubleshooting guide called the emergency checklist. It boils down to quickly finding out the class of failure and attempting a quick fix. Diagnostic measures include:
- checking whether the Indy network is accessible at all using tools ranging from standard Indy CLI to some low-level utilities like netcat
- finding out which transaction types are affected
- using a VALIDATOR_INFO diagnostic transaction to find out each nodes state like their connectivity, ledger sizes and root hashes, and whether nodes are participating in consensus or are in the middle of catch up or view change
- searching through journalctl logs for signs of crashes
- looking through node logs in order to ascertain the sequence of events that led to failure, and searching for signs of data corruption
As already mentioned, quick fixes often boil down either to fixing environmental problems or restarting some or all nodes. Note that if a failure happened after an upgrade, you should first check whether the upgrade itself was successful. We’d like to also reiterate how important it is to analyze and fix the cause of the failure, even if a quick fix already helped. Of course, this also applies to failures of individual nodes found during regular health checks.
To give a sense of what troubleshooting could look like, let’s discuss one hypothetical case (which is based on a real incident that happened more than a year ago described here). Let’s pretend that we find out our Indy network rejects all write transactions, which is a serious problem. A sequence of troubleshooting actions (each of which is described in greater detail in the troubleshooting markdown doc checked in next to code) could be:
- Do a very quick check to ensure that the network is reachable, and that more than 2/3 of nodes are online – which, let’s say, passes
- Check the consensus state of all online nodes – which it shows that half of them are doing a view change, while another half is trying to order transactions, but fail, because more than 2/3 participating nodes are needed for normal operation
- Give it 5-10 minutes to finish, as we know that view change can take some time – If the view change doesn’t finish, meaning that the nodes doing it are stuck, now we ask the stewards of the stuck nodes to restart them, and that helps – nodes that did a restart successfully catch up and start participating in ordering. At some point more than 2/3 nodes are participating and write consensus is restored
- At this point we restored the Indy network to a working state, but the cause of such behavior is still unknown, so it is time to dig deeper into the logs to find what the real cause of the failure is
- After an in-depth analysis of the logs and source code, it was determined that there is a really insidious bug in the Indy Node code which was triggered by a very long and unfortunate sequence of events spanning several weeks.
- So the next step is to actually fix that bug, prepare a new Indy Node release, thoroughly test it and apply to the Indy network, which was done.
In order to prevent failures before they happen, regular health checks should take place, and all new code changes should be thoroughly tested. The earlier a bug is detected and fixed, the less dangerous it is to the Indy network as a whole.
All in all, troubleshooting the Indy network sometimes can be an intimidating task. But with documentation and community collaboration, responding to failures in the Indy network is doable.
Get to production, faster
As the originator and a leading contributor of Hyperledger Indy, we’re dedicated to supporting an open source ecosystem and encourage others to dive into the code. However, if you’re looking for enterprise-grade software, production support, and future-proofed solutions built on open source and open standards, we’d love to introduce you to Verity.