Apache flume

12/17/2023

# Describe regex_extractor to extract different patterns For different pattern, you just have to change the regular expression. The regular expression ^(\d) is for single-digit matching. The event attribute is added in as a header as field role. Then, for the interceptor, define serializer t. The next step is to define interceptor regex_extractor to extract different patterns, it's a interceptor of type regex_extractor. # Use a channel c2 which buffers events in memory # Use a channel c1 which buffers events in memory # Describe/configure the sourceĭefine details for a channel that buffers events in memory. Let’s start with flume configuration, I will keep on appending configuration as we progress.Ĭonfigure source as Netcat provide port and host details. It also supports pluggable serializers for formatting the match groups before adding them as event headers. Based on this, regex_extractor interceptor extracts regex match groups and appends the match groups on the event as headers. First and foremost, regular expressions have to be supplied to Flume. Flume provides regex_extractor interceptor to do the same. Extraction of event attribute has to be done while fetching or streaming data.

It’s absolutely essential to understand the basic crux of multiplexing which is event attribute. Configuring Apache Flume for Multiplexing Plainly speaking, multiplexing and replicating are sides of the same coin with a difference of additional filtering process. The requirement is to copy data from the source is to copy to all configured sinks. Let’s consider we have three sinks: HDFS, Hive, and Avro. In this scenario, we need to move data to all configured channels. Here, we have two sinks pertaining to HDFS and corresponding two channels.įor example, following data should be stored under flume_multiplexing_data/manager with records with employee_role = 1: Role,ID,Name,CityĪnd following data should be stored under flume_multiplexing_data/developer with records with employee_role = 2: 2, E4,Sanjeev,delhi Manager’s data is to be stored at flume_multiplexing_data/manager and developer’s data at flume_multiplexing_data/developer. We have to store data HDFS in the text format, and data has to be stored separately based on role. Now, we want to address following requirements. In the example, I have used Netcat service as a source that listens on a given port and turns each line of text into an event. Now, the source could be real-time streaming or it could be any other source. Requirementsįrom the source, we are getting employee data in the format as below:Įmployee_role, employee_id, employee_name, employee_city. Let’s see how to configure Flume to perform multiplexing and replicating. Here, the event is sent to all channels, which is replicating.Īpache Flume is a very mature tool in this space and supports multiplexing and replicating superbly. You might be required to move logs to ElasticSearch, Cassandra, or Spark Engine.For example, all logs with successful status codes should be stored in Cassandra and the rest of the logs should be fed to Apache Spark to discover interesting facts about such requests. Consider a requirement of storing logs or analyzing/processing the logs based on the status code.

Let’s look at some common scenarios that involve multiplexing or replicating with the help of clickstream analytics or web server log processing application. In the case of replication, events are copied to all channels. In the case of multiplexing, events are routed to a selected channel based on provided event’s attribute matches a preconfigured value.

This moving of data is nothing but copying of a Flume event from an external source to a destination. One of the important aspects of ETL is to fan out data to selected channels, which is commonly known as multiplexing or fan out data to all configured channels, which is data replication. Apache Flume is one of the prominent tools in this space. In the Big Data world, irrespective of the type and domain of the application, one of the commonly required services is ETL to collect, aggregate, and move data from a single source or many sources to a centralized data store or multiple destinations.

0 Comments

Apache flume

Leave a Reply.

Author

Archives

Categories