Back to top

Last week I joined 800 fellow IT leaders and technology experts at the MIT CIO Symposium. The event was packed with insightful speakers and research findings, many of which are directly relevant to big data and the data lake concept specifically. I thought I would share my top two data lake related takeaways, plus a third thing that is just for fun.

In our last blog, we identified how certain types of record formatting problems can lead to corrupt inaccurate data when files are imported into Hadoop. In this post we’ll explore why Hadoop alone struggles to address these types of dirty data problems upon ingest.

Let’s look at a simple example of how embedded delimiters cause problems in Hadoop. Shown here is a file of consumer complaints that comes from the Consumer Financial Protection Bureau, which captures a record for every complaint registered with the bureau.

We’re adding this blog on our website to create some dialogue with our visitors. In the days, weeks, months and years ahead, we’ll offer our thoughts, advice and tips about analytics, big data and enterprise data management that make the promise of Hadoop a reality.