40 likes | 60 Views
Thereu2019s only a fraction of a percentage of data science or BI projects that donu2019t involve some sort of ETL component. Whether thatu2019s as basic as reading in multiple data sources in a Python script, or using something like SSIS to populate a central data store, you end up doing it.
E N D
A Do-It-Yourself ETL Framework in Python There’s only a fraction of a percentage of data science or BI projects that don’t involve some sort of ETL component. Whether that’s as basic as reading in multiple data sources in a Python script, or using something like SSIS to populate a central data store, you end up doing it. I’m a big believer in not reinventing the wheel when it comes to tools. ETL tools are a sort of solved problem, and there’s an abundance of options available. Tools like that don’t come cheap though. They add enormous value for adding lots of connectors, managing of interdependent pipelines, and monitoring/error reporting. The value of an ETL tool scales with the number of connectors you need, number of pipelines you have, and how interdependent those pipelines are. At some tipping point of those 3 factors, an ETL tool becomes a positive value. The tipping point will certainly vary depending on your skillset and resources available. For a recent project, I felt the value proposition was right on that tipping point. It was probably a marginal decision, but we ended up deciding build instead of buy. At the end, we ended up being pretty happy with our decision. Here’s an outline of what we did. For more details ETL Training The Definition Database The ultimate goal of the pipeline is to move data into a warehouse. To facilitate this process, we added a processing schema for pipeline meta processes. This schema has 3 primary tables. The first is the pipeline definition table. This specifies a single pipeline definition. A single query from a particular data source would be considered a pipeline. A production database would have many pipelines to retrieve data. This table keeps a few key pieces of information, like the file location of the query and staging upsert (more on those later) where applicable.
The second is the execution table. This ties a pipeline to an specific execution. The start, end, and last updated times, as well as the execution status and the number of records updated if successful The third is the attribute table. This table is a KVP list of attributes for individual pipelines for various purposes. It helps to generalize pipelines within a given data source. Learn more skills from ETL Testing Online Course The Actual Pipeline Back End The core pipeline functionality was built in Python. The backbone of the pipeline is a class called “pipeline” with a few key shared pieces of functionality. On initialization, pipeline objects require an id. This allows the pipeline to create an execution record and retrieve relevant attributes from the processing schema immediately. For mechanical purposes, we ended up also passing through our Flask app config with connection information and addresses. After initialization, pipelines have 3 self explanatory methods. They are read, load, and stage. The pipeline object may call a variety of functions for reading, loading, or staging data from different data sources. The read process is the most complex and variable of the 3 due to the variety of data sources. Based on the data source type in the pipeline, the read process will route the pipeline to a particular function that’s more specifically built for that type of data source. The diagram below outlines the process: The same overall pipeline object is responsible for all data sources, but pulls from a bank of different functions depending on the data source in order to get the right read source. The load and stage processes are identical across all pipeline sources. The Monitoring Front End
Originally, I added a few Flask restful endpoints just to make testing the application easier. Over time, I ended up adding in a UI to start/stop pipeline runs, which became a real-ish monitoring. Of all the skills that I dabble in, but have no real expertise, front end applications are the weakest. This is my proudest UI accomplishment: The front end is built in Flask, which I, as a total front end scrub, love working in. The endpoints are also managed through the Flask application as well. The Deployment Framework With Python applications, I’ve always had a pretty significant gap between local and production. Enter containers. This was my first attempt at using containers as a deployment mechanism instead of just as a local convenience for a database or whatever else. There’s a whole post on containerized Python upcoming, but the basic outline is: 1.Dockerfiles for main python web app, rabbitmq, and nginx 2.Docker compose to organize those 3 containers 3.AWS ECS/ECR to manage deployment It worked like a dream, and I’m never going back to any other method for deploying Python code unless I absolutely have to. The Invocation Framework With everything deployed in ECS, the only thing left to do was invoke pipelines on a regular basis. This ended up being a little more complex than
expected. We ended up using a combination of Cloudwatch, Flask, and SQS. The details aren’t particularly interesting. The core concept was to use Cloudwatch to trigger posts to endpoints for pipelines at particular intervals. All in all, it works great. Wrap We ended up diving into a lot more tools and services to get this pipeline running than we expected. Despite that, it was still only a 1 month project from start to finish. Adding pipelines has been straightforward, monitoring has been effective, and performance has been excellent. We’ve been live for a few weeks now, and I’d happily make the same call on the pipeline framework again. To get in-depth knowledge, enrol for a live free demo on etl testing online training