2.09k likes | 5.03k Views
Intro to Talend Open Studio for Data Integration. Philip Yurchuk http://philip.yurchuk.com. What is Talend ?. Eclipse-based visual programming editor Generates executable Java code Jobs can run standalone or embedded (no special server) Batch or interactive (user input). What is ETL?.
E N D
Intro to Talend Open StudioforData Integration Philip Yurchuk http://philip.yurchuk.com
What is Talend? • Eclipse-based visual programming editor • Generates executable Java code • Jobs can run standalone or embedded (no special server) • Batch or interactive (user input)
What is ETL? • Extract: suck up data • Transform: mess with it • Load: blow it out • Batch, integration, migration, etc.
Extract from/load to where? • Over 600 components • Over 450 connectors • Allows multiple inputs/outputs in single job
Connectors • Flat files • Delimted (tab, CSV…) • XML • JSON • Excel • Positional • Apache HTTP logs, HL7... • Applications/Platforms • Alfresco • Microsoft Dynamics (CRM, AX) • SAP • Sage ERP X3 • Salesforce • SugarCRM
Connectors (continued) • Relational Databases • MySQL • Postgresql • MS SQL • Oracle • Many more • NoSQL/Columnar/OLAP/Other • Amazon RedShift • Greenplum • Hive • OLAP cubes • LDAP • VectorWise • Teradata • More in Big Data ed.
How do we transport data? • File system • FTP • SFTP/SCP • Web service (SOAP, REST) • HTTP • Mail, POP • XMLRPC, Sockets, JMS, RSS...
Other Components • Process data: join, filter, aggregate • Flow control: loops, job invocation • Logs, statistics • Code: Java, Groovy • On row data or standalone • Can load libraries
Nifty Components • FuzzyMatch - calculate Levenshtein distance or phonetic similarity • IntervalMatch – perform lookup/join based on values falling within an interval • Replace, ReplaceList - search and replace, substitution • UniqRow - output distinct rows based on defined key columns
More Nifty Components • XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs • Normalize/Denormalize - split delimited strings into columns or join columns into a string • AggregateRow – GROUP BY; min, max, sum, other functions used to aggregate rows on a column
Tips and Tricks • CamelCase job names for embedded jobs. • Or prefix with ETL phase and order of execution • Whenever appropriate (esp. for inserting data), use the schema from the repository. • When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.
Tips and Tricks • Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. • On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
Tips and Tricks • Failure handling subjob: • It’s an unconnected job (no triggers point to it) • Use LogCatcher to catch, record component failures. • Record failure in DB, file, email, etc. • Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.
Tips and Tricks • In Java expressions, use methods, not operators. E.g., concat(String) instead of the dot operator, equals(Object) instead of ==. • Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.php?pid=110860
Tips and Tricks • When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. • On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.
Tips and Tricks • Use a context for job variables. • Note you can specify type for variables. • You can read from a file or database, or pass in a context if an embedded Java job.
Tips and Tricks • For multi-host deployment: • Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines. • The context config file has all values required for that host, e.g. test DB connection for test machine. • You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C:\Data\ • Be mindful of file permissions for sensitive context data (e.g., DB password)
Tips and Tricks • Use “Bulk” output components when possible. • For transactional behavior: • Start the job with DB connection • Check “use existing connection” in all relevant components • Check "Die on error" in all relevant components • End job with commit component
Room for Improvement • UI stability • Documentation
Books • Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan • Talend Open Studio Cookbook by Rick Daniel Barton • Big Data book coming…
Talend Forge • http://www.talendforge.org/ • Forum – super helpful • Exchange – free community components! • Tutorials • Bug tracker • Source code
Talend Resources • http://www.talend.com/resources • Help Center • Knowledge Base • Webinars, screencasts • Tutorials • Docs are on download page • And by pressing F1 on a component
Questions?Compliments?Consulting gigs? • Contact me: • philip@yurchuk.com • http://philip.yurchuk.com • http://www.linkedin.com/in/philipyurchuk/