Ed’s Talk: Blogs from the CEO
In my vision, the Graviti Data Platform is the next generation data platform. To define what the next-generation data platform is, it is necessary to understand the current-generation data platform, and its cause and usage scenarios. The demand for the next-generation data platform inevitably comes from some changes in the macro world. Understanding these changes can help to better understand the scenarios in which the next-generation data platform is used and the trade-offs that need to be made in our platform development.
Here is a quick walkthrough of the topics covered in this article. Feel free to drop at the part you are most interested in.
Database VS Data Platform
To understand why a next-generation data platform is needed, we must begin with the birth of the current generation of data platforms. Companies represented by the current generation of data platforms include Snowflake and Databricks. Let’s start with the following two questions:
- Why were companies like Snowflake and Databricks born around 2012?
- What happened at the marcro level that laid the foundation for their birth?
In the following sections, I will share my answers to these questions and the reasoning behind them.
Making decisions through data
The basis for the birth of a data platform
Imagine such a business scenario: I’m the person in charge of a company. Our company has numerous departments, including marketing, sales, finance and HR. The company's marketing department is responsible for spending money on advertising to promote our products, and generate attention and sales leads.
But because the advertising channels are different, the conversion rates may be different, and the number of sales leads obtained from different channels and different campaigns will also vary. The sales department of the company is responsible for converting these prospects, but the efficiency of each sales rep is different. Each customer is likely to be from a different industry, and the final purchase order may also be unique. The company’s financial department has information on all customer payments. Some customers pay slowly, some pay fast, and some owe money. The company's HR department has information about employees in each of the company’s departments.
If you are the person in charge of this company, the goal at this stage is to optimize sales efficiency and bring in more revenue. You need to find the most effective way to obtain high-quality customers willing to pay a lot, and excellent salespeople who can convert them into orders in the shortest time. You’re also interested in recruiting more of these sales reps, while also taking into account the cost of investment. To do this, you must study the data of various departments and integrate these data together to analyze which will enable you to make better decisions. However, you will find the data are actually scattered in each separate department.
In a company’s early stages, the person in charge could ask each department to manually export the data, and put it together in a large spreadsheet for data analysis. This large table is actually the earliest form of a data platform, and the reason for its existence is that an increasing number of managers no longer believe in intuition. Instead of going with their “gut” or instinct, they leverage the numbers to ascertain a far more accurate understanding of the situation, and base decisions on that.
When the air was let out of the dot-com bubble in the early 2000s, database technology became more and more popular. Following that, a series of software was born to help each department better manage internal data, instead of relying on spreadsheets. For example, the marketing department adopted website traffic monitoring products, while the sales department embraced Customer Relationship Management tools (CRM). Behind each software is a relational database for support. With so many convenient management tools, coupled with the development of the business, the scale of the company may continue to expand, and more people can collaborate efficiently together. However, new problems cropped up. Because of the rapid accumulation of data, it often grew too large (several GB). Company managers could no longer use the spreadsheet, which is stored in the computer's memory, and merge manually. And now, a better data platform product is needed, which can automatically pull data (ELT or ETL) from various databases and place the data in a larger table. The table is not in RAM but on the hard disk, where you can quickly filter the data through the SQL interface and get the results you want.
Why you need a next-generation data platform?
As we mentioned in the introduction, this generation of data platform was born for digital-driven decision making, and it will exist for a long time. In the future, enterprises will rise or fall based on their ability to interpret data and use insights to make smart moves. So what is the opportunity for the birth of the next-generation data platform?
The rapid accumulation of unstructured data
The core driving force for the birth of the next-generation data platform
Smartphones we use today generally have as many as four cameras on the back, and there may also be a 3D camera on the front. The popularity of smartphones have greatly reduced the price of sensors, not only on the consumer side, but also on the enterprise side. The communication technology with wider bandwidth and greater throughput makes interaction based on unstructured data possible. Today, your main channels for obtaining information, such as Tiktok,Yelp, and Instagram, are based on unstructured data such as pictures and videos. In the past five years, unstructured data has accumulated rapidly on your devices and at the same time in various enterprises. These data are often scattered in every corner of an organization, and even on every device. How to manage and extract value from these data has become a problem that every organization must solve. Additionally, the mushrooming amount of information poses new and serious challenges to the system architecture that manages this data.
Characteristics of unstructured data platforms
AI is the cornerstone to harness the power of unstructured data
The biggest difference between unstructured and structured data is that it is almost impossible to use unstructured data directly. For decision-making scenarios used by structured data platforms, people are doing analysis based on statistical models, and these numbers themselves exist in structured data (strictly speaking, structured data is defined when writing software. Data generated by the machine contains all the data that the developer wants it to contain). But for computers, unstructured data is just a block of binary data. If you want to extract the meaning, you have to use AI to complete it.
The essence of previous chapters is to start from the use cases, and analyze the characteristics of the platforms from the characteristics of the use cases. So what are the usage scenarios and characteristics of the unstructured data platform?
The first question we naturally ask is : what is the frequency of use of unstructured data platforms and the need for latency?
We just said that AI is the most important way to use unstructured data. We might as well ask ourselves, what is the high-frequency usage scenario for AI? In fact, the high-frequency scene of AI is for predictive applications at a large scale. Almost all large-scale low-latency predictive applications (or direct applications of other algorithms) basically occur on the edge. For example, the predictive algorithm of autonomous driving occurs on the car end. The beauty function in the camera occurs on the mobile phone. The algorithm of the smart camera also runs on the camera's own processing chip. All AI applications that have high requirements for delay basically do not transmit data through the network because the delay of network transmission is hard to control. If high-frequency predictions occur on the end, the most primitive data accumulation also occurs on the end, which scatters unstructured data during the accumulation stage.
The data platform we provide is located in the cloud (regardless of public or private cloud for the time being). Then we have to ask ourselves what kind of algorithm applications are in the cloud?
In fact, almost all offline processing can happen in the cloud. These offline processing include, but are not limited to, model training, large model prediction, algorithm evaluation, temporary analysis, etc. These offline processing are often low-frequency (initiated by a small number of professional people in the organization and used within the organization), and low requirements for delay. These features are very similar to data platforms for structured data. Of course, the difference in data types, the complexity of processing data, and the various scenarios will also make the structured data platform and the unstructured data platform very different.
Let us summarize the characteristics of an unstructured data platform:
- Used in most offline processing scenarios, with large data throughput, but low latency requirements and low frequency of use.
- Used in AI scenarios, there are a large number of model predictions or training related interactions.
- The data on the edge is scattered and limited in scale, but the data scale of the cloud platform is the sum of the data on the edge. In addition to the large amount of data, data collection in distributed edge scenarios must also be considered.
- Because the data is continuously collected by sensors, whether it is audio, video, or three-dimensional data like point clouds, their timing will be an important feature. There will also be a large number of searching, splitting and merging based on timing when using data.
The processing of unstructured data is often accompanied by a large number of AI predictions, and the results of these predictions are often not the final desired results. The storage of intermediate results and the correlation between intermediate results and original data have become particularly important. The processing of unstructured data is often multi-step, so we need to help users solve how to define multiple different processing steps while the amount of data processed in each step is extremely large, and how to schedule enough computing power for data processing. Unlike structured data which has a lot of standard processing operators, the processing of unstructured data is often more closely related to the user’s business, which means that the processing of these data is often not standardized, and each user may not be the same. However, the user wants to customize the processing algorithm itself and also increase related algorithm capabilities. The platform should allow users’ models and algorithm capabilities integrated with our system.
Now we have the characteristics of the second batch:
Data lineage and data consistency have very special significance for the long-term storage, management, and use of data, which should be a requirement of related system design.
The data platform should allow users more easily to define the workflow and pipeline of data processing, customize the processing method, and schedule computing power on a large scale.
The last thing we need to pay attention to is the users themselves. Who are they?
Users of unstructured data platforms are different from users of databases and structured data platforms: they are typically algorithmic engineers. In addition, the interaction and user experience provided by the platform should be different. Algorithm engineers are concerned about the model and the algorithm itself, and for the system, they may neither have much experience nor care. This is exactly what we need to take care of, encapsulate the capabilities of the system into a simple and easy-to-use interactive experience, and finally put it in the hands of algorithm engineers. At the same time, because of different user groups, the unstructured data platform does not essentially replace the structured data platform, but will become a supplementary, independent use.
Other features that the platform should have:
- Provide a reasonable abstraction for algorithm engineers and make the user experience as simple as possible.
Write at the end
In this article, I systematically explained my recent thoughts about Graviti as a next-generation data platform. This not only provided me with a better understanding of our data platform’s position in the entire ecosystem, but also led to increased understanding of the underlying design logic of structured data products and their trade-offs. The purpose is to better help us stand on the shoulders of giants for continuous innovation. We know what we can learn from the existing systems, such as the storage and calculation separation architecture of the data platform system, which is also applicable to us. We should also know which areas we should continue to explore independently, such as data processing and scale of retrieval.
What I haven't discussed deeply in this article is the function of data version control, which is worthy of an in-depth discussion in a separate piece.
You're welcome to comment and leave a message. Thank you all for reading.