Kudu PART 1: the secret to filling the gap between fast analytics and fast data at GeoDB
Problem statement
In this post, we are going to talk about Big Data at GeoDB. We have been investigating and experimenting with several open-source tools in the ecosystem of Big Data Processing and Analytics, mainly Apache Top-Level Projects (TLPs) in the Apache Software Foundation (ASF), so that we are able to adapt technology to our needs and build Extract-Transform-Load (ETL) backend processes that serve the purpose of a visual single pane of glass through which to view and operate the acquisition of datasets for customers willing to enter our data market at GeoDB. Therefore, we want to share with the GeoDB community and beyond this progress in achieving the above goals and also highlight some of the challenges that we expect to face along the way.
The source of the challenges we face comes from the fact of having a lot of incoming data from our users at GeoDB, which will only keep on growing over time. We want to use that data to build a Big Data Marketplace that makes it easier for anyone to exchange data in such a manner so that in the end it creates a circular economy between data providers and data sellers. We believe one of the most important tipping points for customers to buy in our Big Data Marketplace is the visualization of the data so that they are able to really assess the properties of that data and thus its implicit value as well as the explicit value they would pay if they choose to. And secondly, our location data pool needs to be interactively presented to data buyers, which means online near-real-time maps over Big Data and this is when things start to get more interesting.
One option to visualize these datasets in an online near-real-time map as said is to let data buyers decide interactively which regions or points of interest in the map they are willing to pay for and obtain to develop their business prospects or even a personal data science weekend project so, in turn, that means queries to the database holding the data. We went ahead and identified a list of important technical points worth considering for making this interaction with our Big Data Marketplace at GeoDB as interactive as possible for customers but also as performant as we can offer from our side due to the multiple requests for datasets we would expect. To present the right scenario, we would have roughly the following set up:
- The premise: we are already collecting daily millions of data rows at GeoDB, generated by IoT sensors and/or mobile devices and that data is being stored into a database in a cloud service; so assuming a continuous stream of that data coming in, we must be able to keep on processing as well as storing it.
- The visualization challenge: now say that while the above still holds true, you want to provide (i) Real-Time Geospatial Analytics with user-defined boundaries and shapes and also (ii) Dynamic Map Rendering for customers and users to be able to visualize Big Data over those maps interactively, which comes at a cost for our backend.
- The solution: now we see that what we need is a Geospatial Analytics system that is able to render IoT data at large-scale with its core properties and characteristics intact and with the underlying storage system being able to cope with both, substantial volumes of incoming data as well as providing interactive analytics and visualizations. In this same vein, here we show a small advance of how the visualization of a spatial area for location data in the city of Madrid looks.
Figure 1. Selecting Retiro Park area with locations in the city of Madrid
(Original, not for this post: Demo Mapbox routes Madrid metropolitan area)
The approach
Our approach at GeoDB, considering the above points, is that we are gonna be dealing with fast analytics on fast data, namely location data. This is a hard problem to tackle according to common knowledge in the Big Data industry. For instance, traditional approaches used to deal with the problem by relying on Batch ETL pipelines. However, the evolution of the Extract-Transform-Load (ETL) paradigm in the history of Big Data is already moving from ETL Batch to Real-Time Streaming ETL architectures which is great. For our particular use case, the type of Data Processing and Analytics we required, have both pros and cons in any of the ETL architectures but if you are interested in the low-level details we list a few of them here for each ETL, otherwise, keep reading.
Batch ETL
- Pros: it provides offline-analytics capabilities while more data is being ingested/stored. For instance, systems as Apache HBase provide low-latency ingestion, high-throughput and a real-time query API that is not hardcoded to x-, y-, z- type of analytics and it is extensible.
- Cons: several storage systems are necessary and some periodic transformations required for using recent data in analytics. Managing that sort of infrastructure is an additional burden worth to consider.
Real-time ETL
- Pros: the major advantage here is the use of a single storage system for all tasks, from streaming to interactive analytics. Systems as Apache Kudu are not hardcoded to x-, y-, z- type of analytics and are presumably extensible though they are relatively newcomers into the Big Data arena and more interesting use cases are starting to emerge.
- Cons: design is not always suited for OLTP workloads in principle but could be adapted so that at the cost of duplicating the storage system is reduced. However, systems as Apache Kudu already bridge this gap and perhaps one could say that the only shortcoming (at least for now as the system continues to develop) of Kudu is not being able to add columns on demand at runtime. Whether this is relevant or not depends on your schema definition at the end of the day so for us at GeoDB we believe this is something we can assume and anyway one may end up with several schemas, one for each type of data source when dealing with several ones.
We already mentioned fast analytics on fast data is hard. Why is that? The main limitation of batch-oriented systems as Hadoop if you recall has always been the I/O delays and the way of treating append-only file persistence. Naturally, Hadoop can process huge amounts of data in batch but if one requires real-time analytics on fast-changing data, this is hardly an option as you can see. However, fear not! Apache Kudu to the rescue!
The Apache Kudu project, founded by long-time contributors to the Hadoop ecosystem, was incubating since approximately 2013, at the time I was still using Apache HBase but that is a past time story. The important thing here is to realize how fast the Big Data industry is moving. Nowadays, and since around 2016, we actually have a really mature version of Apache Kudu that performs fast analytics on fast-changing data, something that we could not have even dreamed about back then when we started in the business of Big Data. The nice thing about Apache Kudu is that it fits perfectly the purpose of near real-time and real-time ETL architectures, which together with a cartography service becomes a powerful back-end arsenal on top of which to build the type of interactive Big Data Marketplace with Geospatial Analytics we envision. At the technical level, we already see that Apache Kudu allows direct data ingestion in case of backpressure not being an issue but it is also able to connect to streaming services in order to offload pressure before persisting data to storage as you see in the Figure below.
Final thoughts
To conclude, for us at GeoDB the utility of Apache Kudu comes in handy from different perspectives, but given our current timeline and focus on cartography visualizations it may even become a fundamental piece of backend engineering in the near future and thus our final architecture.
Having said that, we trust you will keep on following us and learn more about this topic in the many more upcoming posts where we will be able to look into the details of how to put together a cartography service (e.g., MapBox) and Apache Kudu at GeoDB, something feasible at the technical level thanks to the open-source Big Data tools and support for developers from tools as Mapbox.
Naturally, in the case of GeoDB with the peculiarities of being a novel Big Data Marketplace that requires effective visualization of aggregated data that we plan to make available to data buyers. We look forward to talking more about the latter in our further posts as well as regarding more GeoSpatial storage and analytical systems in the second series of this post.