370 likes | 810 Views
Apache Hadoop Ingestion Patterns & Apache Flume. Ted Malaska. Agenda. Selecting an Ingestion Strategy Apache Flume High Level Components Flume’s Guarantees Common Architectures Detailed Configurations Performance Tuning Example. Selecting a Ingestion Strategy. Timeliness
E N D
Apache Hadoop Ingestion Patterns& Apache Flume Ted Malaska
Agenda • Selecting an Ingestion Strategy • Apache Flume • High Level Components • Flume’s Guarantees • Common Architectures • Detailed Configurations • Performance Tuning • Example
Selecting a Ingestion Strategy • Timeliness • Append or Delta • Access Patterns • Original Source System • Network Concerns • Transformation, Partitioning, and Bifurcation
Timeliness • Macro Batch: 15 minutes to hours • Micro Batch: 4 minutes to 15 minutes • Mini Micro Batch: Under 4 minutes but greater then 30 seconds • Near Real Time Decision Support: Under 30 second but over 2 seconds • Near Real Time Event Processing: Down to about 100 to 200 milliseconds • Real Time:
Append or Delta • Existing Data is Immutable • Existing Data is Mutable for a Fixed Window • Existing Data is Always Mutable
Access Patterns • Batch • MR • Hive • Pig • Crunch • Graph • Time of Thought or NRT • Impala • Search • Get, Put, Scan
Original Source System • File System • RDBMS • Stream • Log Files
Network Concerns • Security • Bandwidth and Compression
Transformation, Partitioning, and Bifurcation • Transformation: Converting XML or JSON to delimiter data. • Partitioning: Incoming data is stock trade data and partitioning by ticker is required • Bifurcation: The data needs to land in HDFS and HBase for different access patterns
Apache Flume • History • Scribe • Flume • Flume NG
High Level Components HDFS Avro Client HBase JMS Sources Point A Interceptors Selectors Channels Sinks Point B
Sources • AvroSource • HTTPSource • NetcatSource • SpoolDirectorySource • ExecSource • JMSSource • ThriftSource • SyslogTcpSource • SyslogUDPSource
Interceptors • RegexExtractorInterceptor • TimestampInterceptor • StaticInterceptor • HostInterceptor • Custom
Selectors • MultoplexingChannelSelector • ReplicatingChannelSelector • Custom
Channel • FileChannel • MemoryChannel
Sinks • HDFSEventSink • HBaseSink • AsyncHBaseSink • NullSink • RollingFileSink • AvroSink • ThriftSink • MorphlineSink • ElasticSearchSink
Flume’s Guarantees • There is no such thing as 100% guarantees • Flume offers several level of configurable guarantees • This is done through transactions
Flume’s Guarantees (Transactions 1 of 3) Submit a Batch Flume Agent Avro Client Confirm Batch With Guarantees
Flume’s Guarantees (Transactions 2 of 3) HDFS Avro Client HBase JMS Sources Point A Interceptors Selectors Channels Sinks Point B
Flume’s Guarantees (Transactions 3 of 3) • Memory Channel: Best Effort • File Channel: JBOD • File Channel: Raid • File Channel: NAS or SAN
Common Architectures (Bifurcation) HDFS HDFS DR
Common Architectures (Alerting or Partitioning) HDFS Partition 1 Partition 2 HBase
Detailed Configurations: Avro Source & Client • Bind and port • Threads • Batch Size • Compression • SSL Encryption • IP Filtering
Detailed Configurations: JMS Source • Connection Factory • Provided URL • Destination Name • Destiniation Type (queue or topic) • Message Selector • User Name • Password File • Batch Size
Detailed Configurations: FileChannel • User home • Data Directories • Capacity • Keep alive • Transaction Capacity • Checkpointing • Directory • Use Dual Checkpoints • Backup checkpoint directory • Checkpoint Interval • Max file size • Minimum required space • useFastReplay • encryptionActiveKey & encryptionCipherProvider
Detailed Configurations: MemoryChannel • Capacity • transactionCapacity • byteCapacity • byteCapacityBufferPercentage • Keep-Alive
Example of Configuration: HDFSEventSink(1 of 3) • hdfs.path • hdfs.filePrefix • Hdfs.inUsePrefix • Hdfs.inUseSuffix • Hdfs.rollInterval • Hdfs.rollCount • Hdfs.rollSize • Hdfs.codeC • Hdfs.fileType • Hdfs.idleTimeout • Hdfs.batchSize • ThreadPoolSize
Example of Configuration: HDFSEventSink (2 of 3) • Path Escaping • Using Headers to partition data Alias Description %{host} Substitute value of event header named “host”. Arbitrary header names are supported. %t Unix time in milliseconds %a locale’s short weekday name (Mon, Tue, ...) %A locale’s full weekday name (Monday, Tuesday, ...) %b locale’s short month name (Jan, Feb, ...) %B locale’s long month name (January, February, ...) %c locale’s date and time (Thu Mar 3 23:05:25 2005) %d day of month (01) %D date; same as %m/%d/%y %H hour (00..23) %I hour (01..12) %j day of year (001..366) %k hour ( 0..23) %m month (01..12) %M minute (00..59) %p locale’s equivalent of am or pm %s seconds since 1970-01-01 00:00:00 UTC %S second (00..59) %y last two digits of year (00..99) %Y year (2010) %z +hhmm numeric timezone (for example, -0400)
Example of Configuration: HDFSEventSink (2 of 3) • File Formats and Compression • Text Files • Sequence Files • Avro Files • Can’t Use Columnar File Types • RC • Parquet
Example of Configuration: HBaseSink • Table name • Column Family • Batch size • Hbase user • kerberosPrincipal & kerberosKeytab • enabledWal • Serializer
Example of Configuration: AsyncHBaseSink • Table name • Column Family • Batch size • Hbase user • enabledWal • Serializer