A Conversation about Blockchain Data

I first met Bob Summerwill two years ago at DevCon in Shanghai when I bought him a beer at a rooftop party overlooking the river. I had heard of him previously through his work on the cpp-ethereum client. Bob impressed me because (among other things) he created the amazing graphic showing the inter-relationships between the components of the cpp client.

Let’s just say Bob knows how to pay attention to detail!

So, I was pleased, recently, when I was able to engage him in a short email conversation about the state of (and possible futures for) the Ethereum data. With Bob’s permission, I present a (somewhat) edited version of our conversation below.

[TJR]
Hi Bob…We met in Shanghai in 2016 and then again this year in Toronto. I wanted to share a few thoughts about problems with the Ethereum data…[I went into detail about my work with QuickBlocks].

[Bob]
No response. [He’s a very busy, so this was not surprising.]

[TJR]
<cricket, cricket>

[Many weeks later…]

[Bob]
Hi Jay. I remember you. How’ve you been? Sorry for taking so long to respond.

[If there’s one thing to say about this entire community it is that everyone, every single one, is busy. I’ll skip the rest of the preliminaries.]

[TJR]
In my work, I try to figure out better ways to extract the Ethereum data from the chain. I’m concerned about the enormous central role certain small groups of people play in the ecosystem, particularly as it relates to accessing and delivering off-chain data. I think this will become a problem in the long term, and that it’s better to think about these issues now rather than later.

[Bob]
I remember you. Thanks for the beer.

[Bob went on to introduce me to a number of different people in the space. Another wonderful thing about this community, everyone is happy to introduce you to other people.]

[We discussed the Open Source Blockchain Explorer, Now! group among other things.]

[TJR]
I’m interested in the delivery of data from the nodes and how people use that data off-chain (for example from blockchain explorers like EtherScan). I have two main concerns: (1) the currently available “off-chain” data is totally centralized (it can be easily censored), and (2) the data is delivered without the “halo” of being consented-to (how does the user know that EtherScan is not lying or making simple mistakes?) I want to use only “provably-true” data. That is, off-chain data that is as trustworthy as on-chain data. Data that cannot be denied and is accompanied with a proof of its derivation all the way back to its origin.

[Bob]
“Provably-true data” — I hear you. I’m interested in watching how TheGraph and EthQL play out.

[TJR]
Those projects concern me. As others have pointed out, there is little incentive for those types of groups to relinquish their position in the ecosystem once they become the “holders of the data”. If they capture a significant portion of the market prior to decentralizing, they won’t be inclined to give it away.

The same is true of EtherScan, BlockScout, Web3Scan, Google Big Table, TheGraph, etc. They are proposing to take on huge and continuing costs to provide Ethereum data. Giving it back to the world in a way that is both unable to be censored and provably-true seems nearly impossible. If the data cannot be censored and it’s immutable and it’s provably-true, people may pay for it, but only once. After that, they will simply cache those parts of the data they’re interested in. There’s no business model that will be able to justify the ever-growing costs.

QuickBlocks goes in the exact opposite direction. We’re trying to think of ways to make the nodes respond better to users’ needs. There’s no reason the nodes can’t be made to be more flexible. I don’t think there is, nor should there be, a sustainable business model that delivers immutable data that’s already been consented to. That data is super valuable but its already fully owned by the community. We shouldn’t have to pay for it twice. Plus, more people can afford to run software than can ever afford to buy data.

[Bob]
There are two different dimensions to be considered here. The software itself (which can and should be open source — problem #1) and the running instances of that infrastructure (problem #2, where the incentive models really bite). My hope would be that we can run instances collectively for infrastructure too.

[I love this distinction and the suggestion that there might be collectively run instances of the infrastructure — another argument against sustainable business models. Why would anyone pay for this data if they can band together and provide it to each other for free? My friend, Ed Mazurek, calls this the ‘public library’ model of delivering the data.]

[In answer to my comment that nodes can be made better…]

I would agree, though changing the way the data works might impact the actual blockchain itself. As Rick has said, Ethereum has made some fundamentally poor design decisions, which may or may not be directly fixable. We also have the issue of different needs for users versus developers — from a light client through to full archival nodes. In SF the other week I was talking to Kumavis about the in-browser client for Metamask over LIBP2P. Their conclusion was that their initial “naive approach” just wasn’t workable, dumping the data straight into IPFS, because of the complexity of the data structures. Even doing something different, there is the incentive issue. Metamask has 1M+ in-browser light client nodes which they could add into the network, but then what is the incentive for all of those nodes to be served? Who will run the full nodes which they leech off? These are big issues.

[TJR]
This is an excellent distinction. You’re right. Even if the software is open source, the cost of standing up that software (and keeping it running) is prohibitive. Running collective instances is one solution. Another is to try to build, way lower down in the pipeline, better techniques that make for more fine-grained access to the data directly from the node. All current explorers anticipate a full extraction from the node into a copy of the data in a multi-terabyte database. The mentality here is to solve all problems for all people. This is inherently centralizing. If instead, the node made accessing particular data for particular users easier, the round trip through a multi-terabyte, fully-centralized solution would be less needed.

Why can’t the nodes write their data to a fully indexed database after the verification is completed? (Yes. I know why. Forks!) Can we build specialized nodes that half light-nodes / half archive nodes optimized for auditing or accounting? This would be a full verifying node, but instead of (or in addition to) storing the previously verified data in an append-only log, it can store it in an indexed database. The log can be regenerated if needed. Perhaps it’s not practical, but the current situation is not practical either.

Also — about the incentive issues. Do the incentives always have to be currency or tokens? Might it not be the case that the data itself, because it’s so valuable, is incentive enough?

[Our conversation ended here with a promise from me to buy Bob another beer in Prague. Cheers.]

Support My Work

I’m interested in your thoughts. Please clap for this article and post your comments below. Consider supporting our work. Send us a small (or large) contribution to 0xB97073B754660BB356DfE12f78aE366D77DBc80f.

Thomas Jay Rush owns the software company QBlocks whose primary project is also called QBlocks, a collection of software libraries and applications enabling real-time, per-block smart contract monitoring and analytics to the Ethereum blockchain. Contact him through the website.