1 / 25

Intro to Talend Open Studio for Data Integration

Intro to Talend Open Studio for Data Integration. Philip Yurchuk http://philip.yurchuk.com. What is Talend ?. Eclipse-based visual programming editor Generates executable Java code Jobs can run standalone or embedded (no special server) Batch or interactive (user input). What is ETL?.

axl
Download Presentation

Intro to Talend Open Studio for Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intro to Talend Open StudioforData Integration Philip Yurchuk http://philip.yurchuk.com

  2. What is Talend? • Eclipse-based visual programming editor • Generates executable Java code • Jobs can run standalone or embedded (no special server) • Batch or interactive (user input)

  3. What is ETL? • Extract: suck up data • Transform: mess with it • Load: blow it out • Batch, integration, migration, etc.

  4. Extract from/load to where? • Over 600 components • Over 450 connectors • Allows multiple inputs/outputs in single job

  5. Connectors • Flat files • Delimted (tab, CSV…) • XML • JSON • Excel • Positional • Apache HTTP logs, HL7... • Applications/Platforms • Alfresco • Microsoft Dynamics (CRM, AX) • SAP • Sage ERP X3 • Salesforce • SugarCRM

  6. Connectors (continued) • Relational Databases • MySQL • Postgresql • MS SQL • Oracle • Many more • NoSQL/Columnar/OLAP/Other • Amazon RedShift • Greenplum • Hive • OLAP cubes • LDAP • VectorWise • Teradata • More in Big Data ed.

  7. How do we transport data? • File system • FTP • SFTP/SCP • Web service (SOAP, REST) • HTTP • Mail, POP • XMLRPC, Sockets, JMS, RSS...

  8. Other Components • Process data: join, filter, aggregate • Flow control: loops, job invocation • Logs, statistics • Code: Java, Groovy • On row data or standalone • Can load libraries

  9. Demo

  10. Nifty Components • FuzzyMatch - calculate Levenshtein distance or phonetic similarity • IntervalMatch – perform lookup/join based on values falling within an interval • Replace, ReplaceList - search and replace, substitution • UniqRow - output distinct rows based on defined key columns

  11. More Nifty Components • XMLMap - Allows joins, column or row filtering, transformations, and multiple outputs • Normalize/Denormalize - split delimited strings into columns or join columns into a string • AggregateRow – GROUP BY; min, max, sum, other functions used to aggregate rows on a column

  12. Tips and Tricks • CamelCase job names for embedded jobs. • Or prefix with ETL phase and order of execution • Whenever appropriate (esp. for inserting data), use the schema from the repository. • When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated.

  13. Tips and Tricks • Propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. • On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.

  14. Tips and Tricks • Failure handling subjob: • It’s an unconnected job (no triggers point to it) • Use LogCatcher to catch, record component failures. • Record failure in DB, file, email, etc. • Add rollback component to undo DB changes if necessary. May need to do this in the job if strategic placement is needed.

  15. Tips and Tricks • In Java expressions, use methods, not operators. E.g., concat(String) instead of the dot operator, equals(Object) instead of ==. • Technical components (like hash maps) are hidden by default. See: http://www.talendforge.org/forum/viewtopic.php?pid=110860

  16. Tips and Tricks • When connecting, propagating changes to a DB component will change it to a built-in schema, which won't get updated after repo changes. • On the other hand, remember that for lookup/join (i.e., SELECT) queries you can modify the query to only select the fields you need. Propagating the schema is useful then.

  17. Tips and Tricks • Use a context for job variables. • Note you can specify type for variables. • You can read from a file or database, or pass in a context if an embedded Java job.

  18. Tips and Tricks • For multi-host deployment: • Export the job with a “bootstrap” context that has all variables, but populates only a context config location that is the same for all machines. • The context config file has all values required for that host, e.g. test DB connection for test machine. • You can rely on the fact that Windows will interpret root as the main system drive, so “/Data/” will translate to C:\Data\ • Be mindful of file permissions for sensitive context data (e.g., DB password)

  19. Tips and Tricks • Use “Bulk” output components when possible. • For transactional behavior: • Start the job with DB connection • Check “use existing connection” in all relevant components • Check "Die on error" in all relevant components • End job with commit component

  20. Room for Improvement • UI stability • Documentation

  21. Books • Getting Started with Talend Open Studio for Data Integration by Bowen Jonathan • Talend Open Studio Cookbook by Rick Daniel Barton • Big Data book coming…

  22. Talend Forge • http://www.talendforge.org/ • Forum – super helpful • Exchange – free community components! • Tutorials • Bug tracker • Source code

  23. Talend Resources • http://www.talend.com/resources • Help Center • Knowledge Base • Webinars, screencasts • Tutorials • Docs are on download page • And by pressing F1 on a component

  24. Questions?Compliments?Consulting gigs? • Contact me: • philip@yurchuk.com • http://philip.yurchuk.com • http://www.linkedin.com/in/philipyurchuk/

  25. Thank You!

More Related