Source: PermaDAO
FirstBatch is the parent company of Dria. Dria is an open source knowledge aggregation platform stored on Arweave. It aims to establish knowledge exchange between humans and machines. It is called the "AI version of Wikipedia" by FirstBatch. Recently, FirstBatch started a research report series on decentralized AI, focusing on the combination of data aggregation issues and decentralization. In this report, we will introduce the contents of the first research report "Data Collection: Quality, Copyright and Ownership". We will focus on how decentralization provides solutions to data collection problems and the risks of decentralized solutions. and challenges.
How decentralization solves the problems encountered in data collection
Problems that current AI teams and developers will encounter in data collection:
1. Unable to collect sufficient data
2. Unable to collect high-quality data
3. Storage issues
4. Privacy Control
5. Copyright Issues
We will look at how decentralization provides solutions to these problems one by one.
In terms of the amount of data collected, Meta’s chief AI scientist pointed out that despite the great progress in LLM, the data used to train AI models is still less than a 4-year-old The amount of information a child acquires. Currently, the types and sources of data are limited to text and certain vertical fields. FirstBatch envisions that teams or individuals can be encouraged to review and filter data through social or financial incentives. This will greatly increase the speed of introducing new data types, and can also add multiple data sources.
Nowadays, the challenges faced by AI developers are the inability to collect high-quality data and the difficulty in detecting the quality of the collected data , because there is a lot of duplicate and outdated data in the data source, and the current automatic detection method reduces the accuracy and quality of the data. FirstBatch is inspired by the experience of improving data quality on open data platforms such as Hugging Face, Kaggle and Wikipedia.FirstBatch proposes to establish a decentralized open data center so that everyone can participate in data Screening, review and evaluation process. Doing so both relieves the processing pressure on a small team dedicated to ensuring the quality of the data set and prevents the data from being manipulated or tampered with by a single organization. If appropriate incentive mechanisms are implemented, these decentralized data open centers and community-based data review processes can ensure the quality of data when data flows in at high speed and in large quantities. Currently, Dria, a product of FirstBatch, is building such a decentralized global knowledge center.
The problems encountered by AI projects on storage are cost and maintenance issues. Faced with the growing amount of data and the subsequent increase in subscription fees, these users have also thought about purchasing larger space in advance to obtain discounts, but this is also a waste from an economic and technical perspective. FirstBatch chooses to store data on Arweave, which stores data permanently, thus protecting against the risk of data loss. Not only that, you can also create a shared data pool on it to allow everyone to store different data, so that different data can be stored in the same place, solving the problem of storing the same data in different places, causing waste of space and waste of storage costs. .
There will be some personally identifiable data in the data. This data is private. Exposing the screening of this data to a collaborative platform for thousands of people to review will violate some privacy regulations. . FirstBatch proposes to use zero-knowledge proof or DID technology before these private data enters the public data screening platform, so that future online activity data can be processed in a privacy-protecting mode.
Many online platforms and media outlets have questioned AI companies’ use of copyrighted material, claiming that the training and use of AI models infringes on the original content. NFT makes the ownership of creative/intellectual property materials very clear and transparent due to the transparency and immutability of actions on the chain. These tokens can be used to verify and identify which materials are subject to what type of procedures, making the data cleansing process and responding to litigation easier.
Risks and challenges of decentralized solutions
Although decentralized solutions Good, butthe issue that remains is the risk posed by the user’s anonymity. For example, when it comes to regulatory issues around copyright or harmful content, anonymous violations can cause larger problems and put platforms at risk. When data is permanently stored on a decentralized network, the uploaded data may still contain harmful content. Even with public data review, it is still inevitable that the content will slip through the net.
One of the current challenges is how to weight data volume and quality incentives. Because no matter how the platform is structured, there will always be people uploading more data of lower quality or data of higher quality but less quantity.
Summary
With the further development of decentralized AI data collection platforms, there will be more opportunities to promote better coordination paradigms to achieve Smoother data collection process. We also look forward to more good news from FirstBatch’s Dria on improving the quantity and quality of data.