Back to top

Podium Pointer #15: Onboarding Legacy Mainframe Files in the Data Marketplace

Welcome to another installment of “Podium Pointers” a series dedicated to tackling complex topics in as few words as possible, and leading readers to additional resources for information and assistance.

What is Old is New Again.  As organizations redefine their data architecture to meet the demands of a data-driven world, an important consideration is augmenting traditional data warehouses with the advantages of data lake infrastructure. To take full advantage of data lake functionality, all data, including that which are infrequently accessed, become available at your fingertips for productive use. Often this means data stored in complex legacy mainframe files such as VSAM or COBOL must be effectively onboarded to help find patterns predicting trends, identifying future revenue opportunities and uncovering other key business insights. 

Simply ingesting these legacy mainframe files as is won't be sufficient. At Podium, we pride ourselves on providing business-ready data upon ingest. This means that even the aforementioned legacy mainframe datasets are immediately query-able by users. No complex custom code, no Hive SerDe’s, no ETL tool coding, just data. Data that is converted to a common format, fully validated, fully profiled and ready for consumers to start using. Period.

Onboarding data with Podium is a simple process involving two basic actions.  First, you connect to the data you want to onboard in order to gather technical metadata about the source. This technical metadata is crucial to ensure that the subsequent data that is loaded can be validated, conformed, profiled, and canonicalized into a query ready format. Other approaches and tools only consider the movement of data into the lake without simultaneously considering its quality. Podium’s unique approach ensures that data quality is an upfront activity versus a downstream activity that has a greater chance of creating incorrect results for end users. Let’s review the steps for gathering the source metadata.

Start by going Source module and clicking on New Source and choose the option Mainframe Source to launch the Mainframe Source wizard.

Step 1 of the wizard allows you to select which source connection to use, to define whether the data is managed, registered, or addressed and define which Source Name and Hierarchy the data should be associated with in Podium. 


Step 2 of the wizard allows you to select your mainframe data file and the COBOL copybook that defines it.  Podium will use the COBOL copybook to parse the EBCDIC file. 


Step 3 of the wizard will interrogate the mainframe file using the COBOL copybook and display the available entities. 


Step 4 of the wizard shows you all the fields available across all the entities you’ve selected.


Step 5 of the wizard allows you to select the format that each entity will be written into HDFS.  Options include ORC, Parquet, and Text Tab Delimited. 


After establishing the technical metadata for the mainframe file, the final action is to load the data applying Podium’s automated validation and profiling. With the new source, Test_Drive_Ingest_Mainframe, we drill down one level to access all the available entities (tables). After selecting the table(s) you want to load, clicking the “Load” button to begin the process.


You will then see the loads executing. Once the loads are finished you’ll be shown the load statistics for all jobs. Podium saves all data, even if it fails to meet the expected canonical format and segments records into three (good, bad, ugly) categories to help in remediation.


Just like that you have now loaded your mainframe file into your data lake, created queryable Hive tables from an original EBCDIC mainframe file, and started a governance process all within an easy to use, code-free environment.  What does this mean? Now data consumers not just have access, but confidence in the data they need not in weeks or months, but in minutes. This is the Podium Data Marketplace.

Contact us to learn more about how you can unlock your data on demand, not lock yourself into complex and expensive custom Java, Python or tool coding while waiting weeks or months.