160 likes | 300 Views
Scalable Architecture for Tax File Processing System. Ambily K K. Contents. Introduction. A Professional Services Organization is involved in processing the tax files of various organization – the volume includes: 30000 customers 36000 Business Rules
E N D
Scalable Architecture for Tax File Processing System Ambily K K
Introduction • A Professional Services Organization is involved in processing the tax files of various organization – the volume includes: • 30000 customers • 36000 Business Rules • 10500 Standard Business and Tax Codes • 1000 Files to be processed / day • 250 MB of Average file size • Tax File processing system receives Tax files from different entities which has their own Tax code system defined for BSI, Vertex, Coins, custom codes, etc., this codes need to be mapped to standard code for completing the file processing. • Tax File Processing System: • Number of files are very high • File size varies from 25KB to 5GB • File formats vary from client to client • Dynamic File content format • Standardization process involves number of rules • Scalable to new client needs
Proof of Concepts Goal Optimize and experiment on various possibilities of improving the performance of File Processing • Improve the current performance levels by introducing the necessary extensions and levers in the current design • Explore the possible alternatives of redesign the implementation • Establish the facts and evidences to conclude and finalize on the approach which is scalable and extensible • 1000 files per day to be processed
Proof of Concepts (Cont.) POC Scope • Compare SSIS via .NET solution to read and process a file (Time consumed, Memory and CPU) • Both Solutions to be optimized so that we can compare Optimized code via Optimized code • Compare using large size file (5GB>test file>1GB) • Compare with multiple files run in parallel • Explore different approach to read file– Chunking – to reduce time/memory/CPU
High Level Architecture Building Blocks Receiving Variability Business Rule Engine Variability is a baseline. Our system is designed to manage variability once, and keep the rest of our technology consistent & stable. This supports XML, H2H, & ERP files. Smart File Scheduler Smart File Chunk File Chunk Validation Normalized File Filter Cache Set Stage File XML H2H
SSIS – Design Thoughts • Decouple the file chunk process from existing design • Parallelize the file chunk operations to chunk files into smaller manageable sets • File Chunk Criteria • Chunk only larger files 500 MB or more. • Chunk into logical sets – leverage XSD templates • Chunking logic for ASCII files can be based on header and detail break-up horizontally • Best Practices for File Chunk • Visit head and tail nodes for chunk • Use LINQ query • Avoid nested Loops Smart File Chunk Chunk Task Chunk Task Chunk Task Chunk Task Logical set Logical set
Database – Design Thoughts This is one of the critical component to built in proactive intelligence into the system about Lifecycle of the File getting processed. Database Design • Core Subsystem entities to be identified and generalized Ex : Data Map, File DeComp, Rule Management Entities • Non-core subsystem entities to be identified and also the key attributes which would help in building intelligence and analytics. Ex : File Priority, Reprocess Flag, Smart Scheduling Entities for dynamic processing Optimization Code Practices • Eliminate redundant visits to entities • Perform block operations • Replace Cursors with Common Table Expressions • Avoid while loops • Increase reusability through usage of Functions
SSIS Via .NET -Benchmarks Both the solutions are optimized to the best possible extent considering all possible limitations and constraints
SSIS Via .NET – Multiple files : Parallel Parallel run on SSIS (de-coupled mode) is showing huge improvement over business layer result when multiple files processed
SSIS – Progressive Improvement Baseline File Size Tracked : 500 MB
Business Layer – Progressive Improvement Baseline File Size Used: 500 MB
Architecture and Design Decisions SSIS Design Decisions • To manage large files – File Chunking approach was chosen • To improve the file processing performance – file chunking was decoupled • To gain better performance - Parallelism implemented in SSIS Packages • Designed a new database for parallel loading of data into cache tables for large sets. • Best practices of coding is followed to gain high performance in SQL Server Business Layer Decisions • End to end processing of data in memory • Removed the temporary tables • Processing the data and loading data into Cache tables • Implemented parallelism while loading data into Cache tables
Key Observations and Advice Key Observations – Business Layer • Memory footprint consumption was high in business layer implementation • Memory footprint consumption was high because of ReadXMLbehavior • Scalability of business layer is not encouraging for large files as memory footprint grows quite large Key Observations – SSIS • Key architecture and design choices made in SSIS is showing credible performance • The current ability of SSIS showing good scalability meeting up the Client file processing objectives Throughput Calculations ( SSIS) • BASELINE: 1000 files / day with average file size of 250mb – requires ability to process 250GB / day • Sequential Throughput : 500 MB in 115 Seconds equates to 375 GB / day to be processed. It crosses the demand of 1000 Files / day by 50% productivity • Multiple File in parallel Throughput: 2.5 GB in 325 seconds equates to 664 GB / day to be processed. It crosses demand of 1000 Files / day by 165% productivity
Final Impression and Recommendation • Able to achieve 250% Performance improvement from baseline results. It allows business to process files of varied sizes from 250 MB to 5 GB • Improved the file processing rate from 250 files / day to 700 files / day. • Introducing more Architecture and Design extensions to the system from Application and Infrastructure angle helps to Cross the 1000 files / day limit with full length of Tax File Operations. Based on the current POC Results and implementation choices by looking into the various Design and Resource Constraints It is recommended that SSIS Design method would be a reliable and scalable implementations for the Tax File Processing system of Tax File Processing System.