From the first wave of dApps Etheroll, ETHLend and CryptoKitties in 2017 to the current flourishing of various financial, gaming and social dApps based on different blockchains, when we talk about decentralized on-chain applications, have we ever thought about the source of various data adopted by these dApps in their interactions?
In 2024, the focus is on AI and Web3. In the world of artificial intelligence, data is like the life source of its growth and evolution. Just as plants rely on sunlight and water to thrive, AI systems also rely on massive amounts of data to continuously "learn" and "think". Without data, no matter how sophisticated the AI algorithm is, it is just a castle in the air and cannot exert its due intelligence and efficiency.
From the perspective of blockchain data accessibility, this article deeply analyzes the evolution of blockchain data indexing in the process of industry development, and compares the old data indexing protocol The Graph with the emerging blockchain data service protocols Chainbase and Space and Time, especially discussing the similarities and differences in data services and product architecture features of these two new protocols combined with AI technology.
2 Complexity and simplicity of data indexing: from blockchain nodes to full-chain databases
2.1 Data source: blockchain nodes
From the beginning of understanding "what is blockchain", we often see this sentence: blockchain is a decentralized ledger. Blockchain nodes are the foundation of the entire blockchain network, and are responsible for recording, storing and disseminating all transaction data on the chain. Each node has a complete copy of the blockchain data to ensure that the decentralized nature of the network is maintained. However, for ordinary users, it is not easy to build and maintain a blockchain node. This not only requires professional technical capabilities, but also comes with high hardware and bandwidth costs. At the same time, ordinary nodes have limited query capabilities and cannot query data in the format required by developers. Therefore, although in theory everyone can run their own nodes, in practice, users usually prefer to rely on third-party services.
To solve this problem, RPC (Remote Procedure Call) node providers came into being. These providers are responsible for the cost and management of nodes and provide data through RPC endpoints. This allows users to easily access blockchain data without building their own nodes. Public RPC endpoints are free, but there are rate limits, which may have a negative impact on the user experience of dApps. Private RPC endpoints provide better performance by reducing congestion, but even simple data retrieval requires a lot of back-and-forth communication. This makes them request-heavy and inefficient for complex data queries. In addition, private RPC endpoints are often difficult to scale and lack compatibility across different networks. But the standardized API interface of the node provider gives users a lower threshold to access data on the chain, laying the foundation for subsequent data parsing and application.
2.2 Data Parsing: From Prototype Data to Available Data
The data obtained from blockchain nodes are often raw data that has been encrypted and encoded. Although these data retain the integrity and security of the blockchain, their complexity also increases the difficulty of data parsing. For ordinary users or developers, directly processing these prototype data requires a lot of technical knowledge and computing resources.
The process of data parsing is particularly important in this context. By parsing complex prototype data and converting it into a format that is easier to understand and operate, users can understand and use this data more intuitively. The success or failure of data parsing directly determines the efficiency and effect of blockchain data applications, and is a key step in the entire data indexing process.
2.3 Evolution of Data Indexers
With the increase in the amount of blockchain data, the demand for data indexers is also increasing. Indexers play a vital role in organizing on-chain data and sending it to a database for easy querying. Indexers work by indexing blockchain data and making it readily available through a SQL-like query language (APIs such as GraphQL). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using a standardized query language, greatly simplifying the process.
Different types of indexers optimize data retrieval in various ways:
Full node indexers: These indexers run a full blockchain node and extract data directly from it, ensuring that the data is complete and accurate, but requiring a lot of storage and processing power.
Lightweight indexers: These indexers rely on full nodes to fetch specific data as needed, reducing storage requirements but potentially increasing query times.
Specialized indexers: These indexers specialize in certain types of data or specific blockchains, optimizing retrieval for specific use cases, such as NFT data or DeFi transactions.
Aggregate indexers: These indexers extract data from multiple blockchains and sources, including off-chain information, and provide a unified query interface, which is particularly useful for multi-chain dApps.
Currently, the archive mode of the Ethereum Archive Node in the Geth client occupies about 13.5 TB of storage space, while the archive requirement is about 3 TB under the Erigon client. As the blockchain continues to grow, the data storage capacity of the archive node will also increase. Faced with such a large amount of data, the mainstream indexer protocol not only supports multi-chain indexing, but also customizes the data parsing framework for the data requirements of different applications. For example, The Graph's "Subgraph" framework is a typical case.
The emergence of indexers has greatly improved the efficiency of data indexing and querying. Compared with traditional RPC endpoints, indexers can efficiently index large amounts of data and support high-speed queries. These indexers allow users to perform complex queries, filter data easily, and analyze it after extraction. In addition, some indexers also support aggregating data sources from multiple blockchains, avoiding the problem of deploying multiple APIs in multi-chain dApps. By running distributedly on multiple nodes, indexers not only provide stronger security and performance, but also reduce the risk of interruption and downtime that centralized RPC providers may bring.
In contrast, indexers use pre-defined query languages to allow users to directly obtain the required information without having to deal with the underlying complex data. This mechanism significantly improves the efficiency and reliability of data retrieval and is an important innovation in blockchain data access.
2.4 Full-chain database: Stream-first alignment
Using index nodes to query data usually means that the API becomes the only portal for digesting on-chain data. However, when a project enters the expansion stage, it often requires more flexible data sources, which cannot be provided by standardized APIs. As application requirements become more complex, primary data indexers and their standardized index formats are gradually unable to meet more and more diverse query requirements, such as search, cross-chain access, or off-chain data mapping.
In modern data pipeline architectures, the "stream first" approach has become a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift enables organizations to respond immediately to incoming data, thereby drawing insights and making decisions almost instantly. Similarly, the development of blockchain data service providers is also moving towards building blockchain data streams. Traditional indexer service providers have successively launched products that obtain real-time blockchain data in the form of data streams, such as The Graph's Substreams, Goldsky's Mirror, and real-time data lakes that generate data streams based on blockchains, such as Chainbase and SubSquid.
These services are designed to address the need for real-time parsing of blockchain transactions and providing more comprehensive query capabilities. Just as the "stream-first" architecture has revolutionized the way data is processed and consumed in traditional data pipelines by reducing latency and enhancing responsiveness, these blockchain data stream service providers also hope to support the development of more applications and assist on-chain data analysis through more advanced and mature data sources.
Redefining the challenges of on-chain data through the perspective of modern data pipelines allows us to look at the full potential of on-chain data management, storage, and provision from a new perspective. When we start to think of subgraphs and indexers such as Ethereum ETL as data flows in a data pipeline rather than final outputs, we can imagine a possible world where high-performance datasets can be tailored for any business use case.
3 AI + Database? In-depth comparison of The Graph, Chainbase, Space and Time
3.1 The Graph
The Graph network implements multi-chain data indexing and query services through a decentralized node network, facilitating developers to easily index blockchain data and build decentralized applications. Its main product models are the data query execution market and the data index cache market. Both markets essentially serve the user's product query needs. The data query execution market specifically refers to consumers choosing the appropriate index node that provides data for the data they need and paying for it. The data index cache market is a market where index nodes mobilize resource allocation based on the historical index popularity of subgraphs, query fees collected, and the on-chain curators' demand for subgraph outputs.
Subgraphs are the basic data structures in The Graph network. They define how to extract and transform data from the blockchain into a queryable format (such as a GraphQL schema). Anyone can create a subgraph, and multiple applications can reuse these subgraphs, which improves data reusability and efficiency.
The Graph product structure (Source: The Graph Whitepaper)
The Graph network consists of four key roles: indexers, curators, delegators and developers, who together provide data support for web3 applications. Here are their respective responsibilities:
Indexer: Indexer is a node operator in The Graph network. Index nodes participate in the network by staking GRT (the native token of The Graph) and provide indexing and query processing services.
Delegator: Delegators are users who stake GRT tokens to index nodes to support their operations. Delegators earn part of the rewards through the index nodes they delegate.
Curator: Curators are responsible for signaling which subgraphs should be indexed by the network. Curators help ensure that valuable subgraphs are prioritized.
Developer: Unlike the first three who are supply-side, developers are demand-side and are the main users of The Graph. They create and submit subgraphs to The Graph network and wait for the network to meet the required data.
At present, The Graph has turned to a comprehensive decentralized subgraph hosting service. There are circulating economic incentives between different participants to ensure the operation of the system:
Index node rewards: Index nodes earn income through consumers' query fees and part of the GRT token block rewards.
Delegator Rewards: Delegators receive partial rewards through the index nodes they support.
Curator Rewards: If curators signal valuable subgraphs, they can receive partial rewards from query fees.
In fact, The Graph's products are also developing rapidly in the wave of AI. As one of the core development teams of The Graph ecosystem, Semiotic Labs has been committed to using AI technology to optimize index pricing and user query experience. Currently, the AutoAgora, Allocation Optimizer and AgentC tools developed by Semiotic Labs have improved the performance of the ecosystem in multiple aspects.
AutoAgora introduces a dynamic pricing mechanism to adjust prices in real time based on query volume and resource usage, optimize pricing strategies, and ensure the competitiveness and revenue maximization of indexers.
Allocation Optimizer solves the complex problem of subgraph resource allocation, helping indexers achieve optimal resource configuration to improve revenue and performance.
AgentC is an experimental tool that allows users to access The Graph's blockchain data through natural language, thereby improving user experience.
The application of these tools has enabled The Graph to further improve the intelligence and user-friendliness of the system in combination with AI assistance.
3.2 Chainbase
Chainbase is a full-chain data network that integrates all blockchain data into one platform, making it easier for developers to build and maintain applications. Its unique features include:
Real-time data lake: Chainbase provides a real-time data lake dedicated to blockchain data streams, allowing data to be accessed instantly when it is generated.
Dual-chain architecture: Chainbase built an execution layer based on Eigenlayer AVS, forming a parallel dual-chain architecture with CometBFT's consensus algorithm. This design enhances the programmability and composability of cross-chain data, supports high throughput, low latency and finality, and improves network security through a dual-staking model.
Innovative data format standards: Chainbase introduced a new data format standard called "manuscripts" to optimize the structuring and utilization of data in the crypto industry.
Crypto world model: With its vast blockchain data resources, Chainbase combined with AI model technology to create an AI model that can effectively understand, predict and interact with blockchain transactions. The basic version of the model Theia has been launched for public use.
These features make Chainbase stand out among blockchain indexing protocols, with a particular focus on the accessibility of real-time data, innovative data formats, and the combination of on-chain and off-chain data to create smarter models to enhance insights.
Chainbase's AI model Theia is the key highlight that distinguishes it from other data service protocols. Based on the DORA model developed by NVIDIA, Theia combines on-chain and off-chain data and spatiotemporal activities to learn and analyze cryptographic patterns and respond through causal reasoning, thereby deeply exploring the potential value and regularity of on-chain data and providing users with more intelligent data services.
AI-enabled data services make Chainbase no longer just a blockchain data service platform, but a more competitive intelligent data service provider. Through powerful data resources and AI's active analysis, Chainbase is able to provide broader data insights and optimize users' data processing.
3.3 Space and Time
Space and Time (SxT) aims to create a verifiable computing layer and expand zero-knowledge proofs on decentralized data warehouses to provide trusted data processing for smart contracts, large language models, and enterprises. Space and Time has received $20 million in the latest round of Series A funding, led by Framework Ventures, Lightspeed Faction, Arrington Capital and Hivemind Capital.
In the field of data indexing and verification, Space and Time has introduced a new technical path - Proof of SQL. This is an innovative zero-knowledge proof (ZKP) technology developed by Space and Time to ensure that SQL queries executed on decentralized data warehouses are tamper-proof and verifiable. When a query is run, Proof of SQL generates a cryptographic proof that verifies the integrity and accuracy of the query results. This proof is attached to the query results, allowing any verifier (such as smart contracts, etc.) to independently confirm that the data has not been tampered with during processing. Traditional blockchain networks usually rely on consensus mechanisms to verify the authenticity of data, while Space and Time's Proof of SQL implements a more efficient way of data verification. Specifically, in Space and Time's system, one node is responsible for data acquisition, while other nodes verify the authenticity of the data through zk technology. This method changes the resource consumption of multiple nodes repeatedly indexing the same data under the consensus mechanism until they finally reach a consensus to obtain data, and improves the overall performance of the system. As this technology matures, it has created a foundation for a series of traditional industries that focus on data reliability to use data on the blockchain to construct products.
At the same time, SxT has been working closely with the Microsoft AI Joint Innovation Lab to accelerate the development of generative AI tools to make it easier for users to process blockchain data through natural language. Currently in Space and Time Studio, users can experience entering natural language queries, and AI will automatically convert them into SQL and execute query statements on behalf of users to present the final results required by users.
3.4 Difference Comparison
Conclusion and Outlook
In summary, blockchain data indexing technology has evolved from the initial node data source, through the development of data parsing and indexers, to AI The empowered full-chain data service has undergone a gradual improvement process. The continuous evolution of these technologies has not only improved the efficiency and accuracy of data access, but also brought unprecedented intelligent experience to users.
Looking to the future, with the continuous development of new technologies such as AI technology and zero-knowledge proof, blockchain data services will become more intelligent and secure. We have reason to believe that blockchain data services will continue to play an important role as infrastructure in the future and provide strong support for the progress and innovation of the industry.
Preview
Gain a broader understanding of the crypto industry through informative reports, and engage in in-depth discussions with other like-minded authors and readers. You are welcome to join us in our growing Coinlive community:https://t.me/CoinliveSG