Migrating Data to the Cloud: Back to the Future
As featured on CIO.com
I won big at a recent “casino night” event by betting all my chips and hitting blackjack on the last hand. After lots of adulation from my peers for my courage, and a small prize (we weren’t playing for money), they asked me why I risked the bet: “There was nothing at stake,” I replied.
The same isn’t true for large businesses planning their migration to the cloud. The promise of on-demand capacity, low-cost storage, and a rich ecosystem of open-source and commercial tools are compelling. But the stakes are real, especially when it comes to migrating data. As hundreds of companies have now demonstrated, a single data breach can cause long-term economic, legal, and brand damage. Beyond data protection, simply managing data in the cloud is different, and if it’s not done right the cost, complexity, and risk can bring down the house.
A simple “lift and shift” of a data warehouse or data lake to the cloud won’t generate cost savings to justify the effort. The cloud technologies that dramatically impact both TCO and scale are low-cost object storage (e.g. Amazon S3, ADLS) and elastic data processing (EMR, Spark). In fact, leveraging these measures to set up an elastic rather than fixed data management cloud environment can lower TCO by as much as 85%.
How Much Does It Cost to Manage Data in the Cloud?
These cost estimates were generated using Amazon’s Cost Calculator and assume a 32-node compute cluster and 500 TB of data under management. The Fixed Cloud option also includes the cost of purchasing enterprise support and services for one of the leading commercially available Hadoop distribution for a 32-node cluster. The Elastic Cloud option assumes a 32-node cluster processing 8 hours per day, 7 days per week.
It’s important to note that the technologies driving down data storage costs provide significantly less data management capabilities. Hadoop is a lot cheaper than Teradata, but it provides none of the data integrity controls, load balancing, and automation of a mature RDBMS. Similarly, S3 is cheaper than storage on Hadoop data nodes, but it’s just a file system. There are no tables, fields, or datatypes. To query or process data on S3 you need to use commercial or open source tools (e.g., AWS Glue, EMR) or write custom programs. To manage and update data in S3 a data management tool (Redshift, Snowflake, Podium) is required. Data protection is limited to encrypting files—not very helpful when you want to analyze datasets that have PII in some fields. Although object storage is scalable, inexpensive, and flexible, it turns the clock back decades on data management.
As with many immature technologies, the limitations of object stores have been touted as features. They “allow” programmers to process data of any size, shape, or quality, and interpret its structure and contents. This “schema on read” approach works well for processing unstructured data or data that changes structure frequently. But it stymies automation, standardization, and scale that is key to collaboration and reuse, because the meaning of the data is buried in the code. Sound familiar? It is. The rallying cry for relational databases was to make the structure and meaning of data declarative, not embedded in COBOL redefines (look it up.)
Bridges Built from a Catalog-First Strategy
The bridge between highly structured databases and “anything goes” object stores is a data catalog. The catalog is a shared database that provides structure and meaning to data in object stores. Hadoop catalogs include HIVE, Atlas, and Navigator, which define how HDFS files comprise tables and fields. Through an API, programs can query the catalog to find the structure of a logical data object, its technical and business properties, access permissions, and the location of the data files. These programs can then push insights and results back into the catalog to enrich it.
However, many cloud catalogs are passive – they scan files and logs to infer the structure and usage of data after they are processed. Data management, however, must be active to ensure that sensitive data is not exposed, important data standards are followed, and rogue actors don’t create a house of cards. All cloud migrations should adopt a catalog-centric policy:
- All shared and sensitive data is registered in a common catalog
- All programs will access data through the catalog and log its activity
This allows a company to provide basic data management that supports a wide range of rapidly evolving technologies. A data lake on S3 can support Hadoop processing, custom PySpark code, R analytics, Amazon Glue, etc. while maintaining (and enriching) a shared data asset. Furthermore, a standard can be set for how data is stored, updated, and checked for quality—which allows lights-out automation of these tasks.
The catalog also enables elasticity, which is central to cloud economics. The catalog can be available 24/7 on a single server, supporting business users shopping for data, developers designing new data products, stewards checking quality and adding business definitions. Only data processing tasks – such as data loads, refreshes, preparation, and analytics—require parallel processing power. Relational databases and Hadoop have traditionally coupled storage, processing, and the catalog in one fixed system and as data grows, costs rise across the board. In the new world, the catalog is again the bridge between processing power and cheap storage. Vast amounts of data can be managed affordably with the catalog, and processing costs can be controlled. In fact, if the catalog has profiling statistics (e.g., cardinality, min, max) it can optimize the processing of the data.
Another benefit of being catalog-centric is portability. Cloud vendors are eager to have you sign up for their integrated, proprietary tools. That is their strategy—once they have your data and code in their applications, they have you. A catalog gives you choice—we literally migrated a customer from one cloud vendor to another in a weekend because it was catalog-driven and automated.
Behind the firewall, a catalog-first strategy is best, and prepares you to be catalog-centric. An automated cataloging tool can give you insights into all your data assets – relational, mainframe, Hadoop, files – in a few weeks, and give you a playbook for migration.
What sources should we migrate?
Where is there GDPR and PII data?
What duplicated and related data should we rationalize?
What is the profile, content, and quality of every field?
The objective is to create cloud-ready data with a verifiable audit trail that attests to its provenance, lineage, and quality. Furthermore, the catalog provides a foundation for agility and scale, through secure, self-service access to a broad user community. With real insights on the readiness of your data to move to the cloud, and a cloud-native catalog ready to manage it, you can accelerate migration with both confidence and control.