Flume的安装和使用

一、安装Flume

1. 下载flume-1.7.0

http://mirrors.shu.edu.cn/apache/flume/1.7.0/apache-flume-1.7.0-bin.tar.gz

2. 解压改名

tar xvf apache-flume-1.7.0-bin.tar.gz

mv apache-flume-1.7.0-bin flume

二、配置Flume

1. 配置sh文件

cp conf/flume-env.sh.template conf/flume-env.sh

2. 配置conf文件

cp conf/flume-conf.properties.template conf/flume.conf

三、将文件输出到日志

1、配置源是那里，目标是那里

# source: avro, channel: memory, sink: log

# Define a memory channel called ch1 on agent1

agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it

# to bind to 0.0.0.0:41414. Connect it to channel ch1.

agent1.sources.avro-source1.channels = ch1

agent1.sources.avro-source1.type = avro

agent1.sources.avro-source1.bind = 0.0.0.0

agent1.sources.avro-source1.port = 41414

# Define a logger sink that simply logs all events it receives

# and connect it to the other end of the same channel.

agent1.sinks.log-sink1.channel = ch1

agent1.sinks.log-sink1.type = logger

# Finally, now that we've defined all of our components, tell

# agent1 which ones we want to activate.

agent1.channels = ch1

agent1.sources = avro-source1

agent1.sinks = log-sink1

2、启动Flume agent

bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n agent1

3、启动Flume Client

echo hello, world! > /usr/local/flume-test.log

echo hello, hadoop! >> /usr/local/flume-test.log

echo hello, flume! >> /usr/local/flume-test.log

echo hello, spark! >> /usr/local/flume-test.log

bin/flume-ng avro-client --conf conf -H localhost -p 41414 -F /usr/local/flume-test.log -Dflume.root.logger=DEBUG,console

4、client启动结果

2018-05-11 11:24:05,624 (main) [DEBUG - org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:498)] Batch size string = 5

2018-05-11 11:24:05,644 (main) [WARN - org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:634)] Using default maxIOWorkers

2018-05-11 11:24:06,100 (main) [DEBUG - org.apache.flume.client.avro.AvroCLIClient.run(AvroCLIClient.java:233)] Finished

2018-05-11 11:24:06,100 (main) [DEBUG - org.apache.flume.client.avro.AvroCLIClient.run(AvroCLIClient.java:236)] Closing reader

2018-05-11 11:24:06,101 (main) [DEBUG - org.apache.flume.client.avro.AvroCLIClient.run(AvroCLIClient.java:240)] Closing RPC client

2018-05-11 11:24:06,107 (main) [DEBUG - org.apache.flume.client.avro.AvroCLIClient.main(AvroCLIClient.java:84)] Exiting

5、Flume Server结果

2018-05-11 11:24:05,772 (New I/O server boss #3) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0xce7f8eac, /127.0.0.1:52790 => /127.0.0.1:41414] OPEN

2018-05-11 11:24:05,773 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0xce7f8eac, /127.0.0.1:52790 => /127.0.0.1:41414] BOUND: /127.0.0.1:41414

2018-05-11 11:24:06,076 (New I/O worker #1) [DEBUG - org.apache.flume.source.AvroSource.appendBatch(AvroSource.java:378)] Avro source avro-source1: Received avro event batch of 4 events.

2018-05-11 11:24:06,104 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0xce7f8eac, /127.0.0.1:52790 :> /127.0.0.1:41414] DISCONNECTED

2018-05-11 11:24:06,104 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.channelClosed(NettyServer.java:209)] Connection to /127.0.0.1:52790 disconnected.

2018-05-11 11:24:08,870 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 hello, world! }

2018-05-11 11:24:08,871 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 2C 20 73 70 61 72 6B 21 hello, spark! }

可以看到server端收到了client写入的数据

四、用Flume监控目录将新文件上传到HDFS

1、配置方式

# 将/usr/local/flume/tmpData下的文本文件上传到hdfs://node10:9000/test/

# source: spooldir, channel: memory, sink: hdfs

# Define a memory channel called ch1 on agent1

agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it

# to bind to 0.0.0.0:41414. Connect it to channel ch1.

agent1.sources.spooldir-source1.channels = ch1

agent1.sources.spooldir-source1.type = spooldir

agent1.sources.spooldir-source1.spoolDir=/usr/local/flume/tmpData

# Event的headers增加文件名如：headers:{basename=a.txt}

agent1.sources.spooldir-source1.basenameHeader = true

# Define a logger sink that simply logs all events it receives

# and connect it to the other end of the same channel.

# 文件名是：a.txt.1536559992413

agent1.sinks.hdfs-sink1.channel = ch1

agent1.sinks.hdfs-sink1.type = hdfs

agent1.sinks.hdfs-sink1.hdfs.path = hdfs://node10:9000/test

# hdfs.filePrefix = log_%Y%m%d_%H 如：log_20151016_13.1444973768543

agent1.sinks.hdfs-sink1.hdfs.filePrefix = %{basename}

agent1.sinks.hdfs-sink1.hdfs.useLocalTimeStamp = true

agent1.sinks.hdfs-sink1.hdfs.round = true

agent1.sinks.hdfs-sink1.hdfs.roundValue = 10

agent1.sinks.hdfs-sink1.hdfs.rollSize = 0

agent1.sinks.hdfs-sink1.hdfs.rollCount = 0

# Finally, now that we've defined all of our components, tell

# agent1 which ones we want to activate.

agent1.channels = ch1

agent1.sources = spooldir-source1

agent1.sinks = hdfs-sink1

2、通过Flume启动agent

bin/flume-ng agent --conf ./conf/ -f ./conf/flume.conf --name agent1 -Dflume.root.logger=DEBUG,console

3、查看hdfs

http://node10（节点ip）:50070

3、监控目录的文件

每个文件读取完之后，修改后缀为.COMPLETED，重启flume时忽略这些文件

五、多个Agent - 多个Collector配置

1、Flume Agent配置

#flume-client.properties

##### source: log(node1), channel: memory, sink: avro(node2, node3)

#agent1 name

agent1.channels = c1

agent1.sources = r1

agent1.sinks = k1 k2

#set gruop

agent1.sinkgroups = g1

#set channel

agent1.channels.c1.type = memory

agent1.channels.c1.capacity = 1000

agent1.channels.c1.transactionCapacity = 100

agent1.sources.r1.channels = c1

agent1.sources.r1.type = exec

agent1.sources.r1.command = tail -F /usr/local/hadoop/logs/hadoop-root-namenode-node1.log

agent1.sources.r1.interceptors = i1 i2

agent1.sources.r1.interceptors.i1.type = static

agent1.sources.r1.interceptors.i1.key = Type

agent1.sources.r1.interceptors.i1.value = LOGIN

agent1.sources.r1.interceptors.i2.type = timestamp

# set sink1

agent1.sinks.k1.channel = c1

agent1.sinks.k1.type = avro

agent1.sinks.k1.hostname = node2

agent1.sinks.k1.port = 52020

# set sink2

agent1.sinks.k2.channel = c1

agent1.sinks.k2.type = avro

agent1.sinks.k2.hostname = node3

agent1.sinks.k2.port = 52020

#set sink group

agent1.sinkgroups.g1.sinks = k1 k2

#设置失效恢复

agent1.sinkgroups.g1.processor.type = failover

agent1.sinkgroups.g1.processor.priority.k1 = 10

agent1.sinkgroups.g1.processor.priority.k2 = 1

agent1.sinkgroups.g1.processor.maxpenalty = 10000

2、Flume Collector配置

# flume-server.properties

#set Agent name

a1.sources = r1

a1.channels = c1

a1.sinks = k1

#set channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# other node,change node2 to node3

a1.sources.r1.type = avro

a1.sources.r1.bind = node2

a1.sources.r1.port = 52020

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = static

a1.sources.r1.interceptors.i1.key = Collector

a1.sources.r1.interceptors.i1.value = NNA

a1.sources.r1.channels = c1

#set sink to hdfs

a1.sinks.k1.type=hdfs

a1.sinks.k1.hdfs.path=/flume/logdfs

a1.sinks.k1.hdfs.fileType=DataStream

a1.sinks.k1.hdfs.writeFormat=TEXT

a1.sinks.k1.hdfs.rollInterval=1

a1.sinks.k1.channel=c1

a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d

2、Flume启动Agent

flume-ng agent -n agent1 -c conf -f conf/flume-client.properties -Dflume.root.logger=DEBUG,console

3、Flume启动Collector

flume-ng agent -n a1 -c conf -f conf/flume-server.properties -Dflume.root.logger=DEBUG,console

4、Flume Failover测试

Collector1的权重比Collector2大

kill掉Collector1，Collector2工作

恢复Collector1，Collector1工作

5、Flume LoadBalancer

# 设置负载均衡

#agent1.sinkgroups.g1.processor.type = load_balance

#agent1.sinkgroups.g1.processor.backoff = true

#agent1.sinkgroups.g1.processor.selector = round_robin

六、Flume 整合Kafka

#设置Kafka接收器

a1.sinks.k2.type= org.apache.flume.sink.kafka.KafkaSink

#设置Kafka的broker地址和端口号

a1.sinks.k2.kafka.bootstrap.servers=node1:9092,node2:9092,node3:9092

#a1.sinks.k2.defaultPartitionId=0

a1.sinks.k2.producer.type=sync

#设置Kafka的Topic

a1.sinks.k2.kafka.topic=TestTopic

#设置序列化方式

a1.sinks.k2.serializer.class=kafka.serializer.StringEncoder

a1.sinks.k2.channel=c1

七、flume传递header到spark

#a1.sinks.k2.serializer.class=org.apache.kafka.common.serialization.BytesDeserializer

a1.sinks.k2.useFlumeEventFormat = true

val kafkaParams = Map[String, Object](

"bootstrap.servers" -> "192.168.60.15:9092",

"key.deserializer" -> classOf[StringDeserializer],

"value.deserializer" -> classOf[BytesDeserializer],

"group.id" -> "use_a_separate_group_id_for_each_stream",

"auto.offset.reset" -> "latest",

"enable.auto.commit" -> (false: java.lang.Boolean))

val topics = Array("sspclick")

val sspclick = KafkaUtils.createDirectStream[String, Bytes](

ssc,

PreferConsistent,

Subscribe[String, Bytes](topics, kafkaParams))

val reader = new SpecificDatumReader[AvroFlumeEvent](classOf[AvroFlumeEvent])

val a = sspclick.map(f => {

var body = f.value().get

var decoder = DecoderFactory.get().binaryDecoder(body, null);

var result = reader.read(null, decoder);

var hostname = result.getHeaders.get(new Utf8("hostname"))

var text = new String(result.getBody.array())

var array = text.split("\t")