Author: Kerman Kohli Source: substack Translation: Shan Ouba, Golden Finance
It's 2024, and you would think that getting encrypted data would be easy because with Etherscan, Dune, and Nansen, you can view the data you want at any time. On the surface, it does seem like that.
Scale
You see, in the normal web2 world, when your company has 10 employees and 100,000 customers, the amount of data you generate may not exceed 100 GB (at the upper hand). This data scale is small enough that your iPhone can handle any of your questions and store everything. However, once you have 1,000 employees and 100,000,000 customers, the amount of data you process may now be hundreds of TB, or even PB.
This is fundamentally a completely different challenge because the scale you are dealing with requires more consideration. To process hundreds of TB of data, you need a distributed computer cluster to send the job. When sending these jobs, you have to consider:
What happens if a worker fails to perform their duties
What happens if one worker takes much longer than the others
How do you determine which worker to assign which job
How do you merge all the results together and ensure the calculations are correct
These are all things to consider when dealing with big data calculations across multiple machines. Scale creates problems that are invisible to those who don't work with it. Data is one of those areas where the larger the scale, the more infrastructure you need to manage it correctly. To most people, these problems are invisible. To handle this scale, you face other challenges:
Extremely specialized talent that knows how to operate machines at this scale
The cost of storing and computing all that data
Forward planning and architecture to ensure your needs can be supported
Interestingly, in web2, everyone wanted data to be public. In web3, it’s finally public, but few people know how to do the necessary work to understand it. The deceptive fact is that with some help, you can get your dataset out of the global dataset fairly easily, meaning that “local” data is easy, but “global” data is hard to get (the stuff that’s about everyone and everything).
Fragmentation
As if things weren’t already challenging because of the scale you have to deal with. Now there’s a new dimension that makes crypto data challenging, and that’s the constant fragmentation of crypto data due to the economic incentives of the market. For example:
The rise of new blockchains. There are nearly 50 L2s live, 50 known to be coming soon, and hundreds more in the pipeline. Each L2 is effectively a new database source that needs to be indexed and configured. Hopefully they are standardized, but you can’t always be sure!
The rise of new kinds of VMs. The EVM is just one area. SVM, Move VM, and countless others are coming to market. Each new kind of VM means a whole new data scheme that must be considered from a fundamental and deeply understood perspective. How many VMs are there? Investors are incentivizing new things with billions of dollars!
The rise of new account primitives. Smart contract wallets, custodial wallets, account abstractions introduce new complexities to how you actually interpret data. The sender address may not actually be the real user because it was submitted by a relay, and the real user may be somewhere in the mix (if you look closely).
Fragmentation can be particularly challenging because you can’t quantify what you don’t know. You’ll never know all the L2s in existence in the world and how many VMs there will be in total. Once they get to enough scale, you’ll be able to keep up, but that’s another story.
Open, but not interoperable
The last problem I think surprises a lot of people is that the data is open, but not easily interoperable. You see, all the smart contracts that teams have pieced together are like small databases within a large database. I like to think of them as schemas. All the data is out there, but the team developing the smart contract generally knows how to piece it together. You can spend time trying to understand it yourself if you want, but you’d have to do this hundreds of times for all the potential schemas — and how can you do that without spending a fortune without having a buyer on the other side of the transaction?
If this seems too abstract, let me give you an example. You say “How often does this user use a bridge?”. While this seems like one question, there are many questions nested within it. Let’s break it down:
First, you need to know all the bridges that exist. And the chains you care about. If it’s all chains, then we’ve already mentioned above why this is challenging.
Then, for each bridge, you need to understand how their smart contracts work
Once you understand all the permutations, you now need to reason through a model that can unify all of these individual patterns
Each of the above challenges is difficult to solve and requires a lot of resources.
The Result
So where does all this lead to? Well, the state of our ecosystem today is…
No one in the ecosystem really knows what is really going on. There are only notions of activity that are difficult to properly quantify.
User numbers are inflated and Sybil attacks are difficult to detect. Metrics start to become irrelevant and untrustworthy! True or false doesn’t even matter to market participants because they all look the same.
The main problem with making on-chain identity real. Accurate data is essential if you want to have a strong sense of identity, otherwise your identity will be misrepresented!
I hope this article helped you understand the realities of the crypto data landscape.