配置
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.88.129:8020/test/
a1.sinks.k1.hdfs.filePrefix = logs
a1.sinks.k1.hdfs.inUsePrefix = .
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 1048576
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 6000
a1.sinks.k1.hdfs.writeFormat = text
a1.sinks.k1.hdfs.fileType = DataStream
代码分析
对于flume的hdfs sink来说,我们先从configure方法观察
这里面获取了我们配置文件里面配置的属性,比如:
rollInterval = context.getLong("hdfs.rollInterval", defaultRollInterval);
rollSize = context.getLong("hdfs.rollSize", defaultRollSize);
rollCount = context.getLong("hdfs.rollCount", defaultRollCount);
batchSize = context.getLong("hdfs.batchSize", defaultBatchSize);
if (codecName == null) {
codeC = null;
compType = CompressionType.NONE;
} else {
codeC = getCodec(codecName);
// TODO : set proper compression type
compType = CompressionType.BLOCK;
}
一般来说我们会将rollInterval和rollCount设置为0,否则默认是30和10次就会滚动文件
看官方文档的解释:
hdfs.rollInterval 30 Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollCount 10 Number of events written to file before it rolled (0 = never roll based on number of events)
配置完成后就要进行启动了,找到sink 中的start方法
public void start() {
String timeoutName = "hdfs-" + getName() + "-call-runner-%d";
callTimeoutPool = Executors.newFixedThreadPool(threadsPoolSize,
new ThreadFactoryBuilder().setNameFormat(timeoutName).build());
//定时滚动
String rollerName = "hdfs-" + getName() + "-roll-timer-%d";
timedRollerPool = Executors.newScheduledThreadPool(rollTimerPoolSize,
new ThreadFactoryBuilder().setNameFormat(rollerName).build());
this.sfWriters = new WriterLinkedHashMap(maxOpenFiles);
sinkCounter.start();
super.start();
}
真正接收数据是在process()方法中:
从channel中获取evevt,根据配置的batchSize作为一个批次
写入到HDFS文件中。
根据配置的文件信息 获取到一个bucketWriter,实际写入到HDFS是通过bucketWriter
for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
Event event = channel.take();
if (event == null) {
break;
}
// reconstruct the path name by substituting place holders
String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
timeZone, needRounding, roundUnit, roundValue, useLocalTime);
String realName = BucketPath.escapeString(fileName, event.getHeaders(),
timeZone, needRounding, roundUnit, roundValue, useLocalTime);
String lookupPath = realPath + DIRECTORY_DELIMITER + realName;
BucketWriter bucketWriter;
HDFSWriter hdfsWriter = null;
// Callback to remove the reference to the bucket writer from the
// sfWriters map so that all buffers used by the HDFS file
// handles are garbage collected.
WriterCallback closeCallback = new WriterCallback() {
@Override
public void run(String bucketPath) {
LOG.info("Writer callback called.");
synchronized (sfWritersLock) {
sfWriters.remove(bucketPath);
}
}
};
synchronized (sfWritersLock) {
bucketWriter = sfWriters.get(lookupPath);
// we haven't seen this file yet, so open it and cache the handle
if (bucketWriter == null) {
//获取具体hdfswriter的实现类
hdfsWriter = writerFactory.getWriter(fileType);
//根据配置的文件信息 获取到一个bucketWriter,实际写入到HDFS是通过bucketWriter
bucketWriter = initializeBucketWriter(realPath, realName,
lookupPath, hdfsWriter, closeCallback);
sfWriters.put(lookupPath, bucketWriter);
}
}
// track the buckets getting written in this transaction
if (!writers.contains(bucketWriter)) {
writers.add(bucketWriter);
}
// Write the data to HDFS
try {
bucketWriter.append(event);
} catch (BucketClosedException ex) {
LOG.info("Bucket was closed while trying to append, " +
"reinitializing bucket and writing event.");
hdfsWriter = writerFactory.getWriter(fileType);
bucketWriter = initializeBucketWriter(realPath, realName,
lookupPath, hdfsWriter, closeCallback);
synchronized (sfWritersLock) {
sfWriters.put(lookupPath, bucketWriter);
}
bucketWriter.append(event);
}
}
if (txnEventCount == 0) {
sinkCounter.incrementBatchEmptyCount();
} else if (txnEventCount == batchSize) {
sinkCounter.incrementBatchCompleteCount();
} else {
sinkCounter.incrementBatchUnderflowCount();
}
// flush all pending buckets before committing the transaction
for (BucketWriter bucketWriter : writers) {
bucketWriter.flush();
}
可以看到真正的写入操作交给了BucketWriter类,
注意在进行BucketWriter初始化的时候,传入了一个hdfsWriter,在上面的案例中我们配置了hdfs.fileType = DataStream,
因此这里hdfsWriter = writerFactory.getWriter(fileType); 获得的具体实现类是HDFSDataStream()
public HDFSWriter getWriter(String fileType) throws IOException {
if (fileType.equalsIgnoreCase(SequenceFileType)) {
return new HDFSSequenceFile();
} else if (fileType.equalsIgnoreCase(DataStreamType)) {
return new HDFSDataStream();
} else if (fileType.equalsIgnoreCase(CompStreamType)) {
return new HDFSCompressedDataStream();
} else {
throw new IOException("File type " + fileType + " not supported");
}
}
让我们看看BucketWriter类
当调用append方法的时候,会先进行调用open方法,最终调用的是HDFSWriter里面的open方法
public void open(String filePath) throws IOException {
Configuration conf = new Configuration();
Path dstPath = new Path(filePath);
FileSystem hdfs = getDfs(conf, dstPath);
doOpen(conf, dstPath, hdfs);
}
可以点到里面看,会看到最终获取到了一个输出流
outStream = hdfs.create(dstPath);
然后会构造一个serializer, Flume 默认的 serializerType 配置是 TEXT,即使用 BodyTextEventSerializer 来序列化数据
serializer = EventSerializerFactory.getInstance(
serializerType, serializerContext, outStream);
回到BucketWriter的append方法
可以看到这里执行了写入event的操作
callWithTimeout(new CallRunner<Void>() {
@Override
public Void call() throws Exception {
writer.append(event); // could block
return null;
}
});
实际上调用的就是HDFSWriter里面的append方法,而HDFSWriter的append实际上是调用了serializer的write方法。
public void append(Event e) throws IOException {
serializer.write(e);
}
通过上面我们得知根据我们的配置serializer的实现类是BodyTextEventSerializer
因此最终调用了如下方法写入event到HDFS
public void write(Event e) throws IOException {
out.write(e.getBody());
if (appendNewline) {
out.write('\n');
}
}
可以看到默认情况每写入一个event,flume会帮我们在后面追加一个换行符, 这里可以通过配置去掉。
关于
rollCount,rollSize两个参数的使用
BucketWriter中进行append的时候会调用 shouldRotate()方法判断是否需要rotate the file。
//判断event的个数
if ((rollCount > 0) && (rollCount <= eventCounter)) {
LOG.debug("rolling: rollCount: {}, events: {}", rollCount, eventCounter);
doRotate = true;
}
//这里判断 a1.sinks.k1.hdfs.rollSize = 1048576,如果设置的rollSize小于已经写入的大小,就需要进行滚动
if ((rollSize > 0) && (rollSize <= processSize)) {
LOG.debug("rolling: rollSize: {}, bytes: {}", rollSize, processSize);
doRotate = true;
}
关于processSize:
processSize += event.getBody().length; 每个event的大小的累加
总结:
HDFSSink比较常用,这里结合简单的配置来阅读其代码,防止后续使用中遇到问题无从下手解决。
点击查看更多内容
为 TA 点赞
评论
共同学习,写下你的评论
评论加载中...
作者其他优质文章
正在加载中
感谢您的支持,我会继续努力的~
扫码打赏,你说多少就多少
赞赏金额会直接到老师账户
支付方式
打开微信扫一扫,即可进行扫码打赏哦