为了账号安全,请及时绑定邮箱和手机立即绑定

Flume HDFS Sink 代码学习

标签:
Java 大数据

配置

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.88.129:8020/test/
a1.sinks.k1.hdfs.filePrefix = logs
a1.sinks.k1.hdfs.inUsePrefix = .

a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 1048576
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 6000

a1.sinks.k1.hdfs.writeFormat = text
a1.sinks.k1.hdfs.fileType = DataStream

代码分析

对于flume的hdfs sink来说,我们先从configure方法观察
这里面获取了我们配置文件里面配置的属性,比如:
rollInterval = context.getLong("hdfs.rollInterval", defaultRollInterval);
    rollSize = context.getLong("hdfs.rollSize", defaultRollSize);
    rollCount = context.getLong("hdfs.rollCount", defaultRollCount);
    batchSize = context.getLong("hdfs.batchSize", defaultBatchSize);
   
   if (codecName == null) {
      codeC = null;
      compType = CompressionType.NONE;
    } else {
      codeC = getCodec(codecName);
      // TODO : set proper compression type
      compType = CompressionType.BLOCK;
    }
  
一般来说我们会将rollInterval和rollCount设置为0,否则默认是3010次就会滚动文件
看官方文档的解释:
hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)

配置完成后就要进行启动了,找到sink 中的start方法
 public void start() {
   
    String timeoutName = "hdfs-" + getName() + "-call-runner-%d";
    callTimeoutPool = Executors.newFixedThreadPool(threadsPoolSize,
            new ThreadFactoryBuilder().setNameFormat(timeoutName).build());

    //定时滚动
    String rollerName = "hdfs-" + getName() + "-roll-timer-%d";
    timedRollerPool = Executors.newScheduledThreadPool(rollTimerPoolSize,
            new ThreadFactoryBuilder().setNameFormat(rollerName).build());

    this.sfWriters = new WriterLinkedHashMap(maxOpenFiles);
    sinkCounter.start();
    super.start();
  }

真正接收数据是在process()方法中:
从channel中获取evevt,根据配置的batchSize作为一个批次
写入到HDFS文件中。

根据配置的文件信息 获取到一个bucketWriter,实际写入到HDFS是通过bucketWriter

for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
        Event event = channel.take();
        if (event == null) {
          break;
        }

        // reconstruct the path name by substituting place holders
        String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
            timeZone, needRounding, roundUnit, roundValue, useLocalTime);
        String realName = BucketPath.escapeString(fileName, event.getHeaders(),
            timeZone, needRounding, roundUnit, roundValue, useLocalTime);

        String lookupPath = realPath + DIRECTORY_DELIMITER + realName;
        BucketWriter bucketWriter;
        HDFSWriter hdfsWriter = null;
        // Callback to remove the reference to the bucket writer from the
        // sfWriters map so that all buffers used by the HDFS file
        // handles are garbage collected.
        WriterCallback closeCallback = new WriterCallback() {
          @Override
          public void run(String bucketPath) {
            LOG.info("Writer callback called.");
            synchronized (sfWritersLock) {
              sfWriters.remove(bucketPath);
            }
          }
        };
        synchronized (sfWritersLock) {
          bucketWriter = sfWriters.get(lookupPath);
          // we haven't seen this file yet, so open it and cache the handle
          if (bucketWriter == null) {
		  
		    //获取具体hdfswriter的实现类
            hdfsWriter = writerFactory.getWriter(fileType);

            //根据配置的文件信息 获取到一个bucketWriter,实际写入到HDFS是通过bucketWriter

            bucketWriter = initializeBucketWriter(realPath, realName,
              lookupPath, hdfsWriter, closeCallback);
            sfWriters.put(lookupPath, bucketWriter);
          }
        }

        // track the buckets getting written in this transaction
        if (!writers.contains(bucketWriter)) {
          writers.add(bucketWriter);
        }

        // Write the data to HDFS
        try {
          bucketWriter.append(event);
        } catch (BucketClosedException ex) {
          LOG.info("Bucket was closed while trying to append, " +
                   "reinitializing bucket and writing event.");
          hdfsWriter = writerFactory.getWriter(fileType);
          bucketWriter = initializeBucketWriter(realPath, realName,
            lookupPath, hdfsWriter, closeCallback);
          synchronized (sfWritersLock) {
            sfWriters.put(lookupPath, bucketWriter);
          }
          bucketWriter.append(event);
        }
      }


      if (txnEventCount == 0) {
        sinkCounter.incrementBatchEmptyCount();
      } else if (txnEventCount == batchSize) {
        sinkCounter.incrementBatchCompleteCount();
      } else {
        sinkCounter.incrementBatchUnderflowCount();
      }

      // flush all pending buckets before committing the transaction
      for (BucketWriter bucketWriter : writers) {
        bucketWriter.flush();
      }

可以看到真正的写入操作交给了BucketWriter类,
注意在进行BucketWriter初始化的时候,传入了一个hdfsWriter,在上面的案例中我们配置了hdfs.fileType = DataStream,
因此这里hdfsWriter = writerFactory.getWriter(fileType);  获得的具体实现类是HDFSDataStream()
public HDFSWriter getWriter(String fileType) throws IOException {
    if (fileType.equalsIgnoreCase(SequenceFileType)) {
      return new HDFSSequenceFile();
    } else if (fileType.equalsIgnoreCase(DataStreamType)) {
      return new HDFSDataStream();
    } else if (fileType.equalsIgnoreCase(CompStreamType)) {
      return new HDFSCompressedDataStream();
    } else {
      throw new IOException("File type " + fileType + " not supported");
    }
  }

让我们看看BucketWriter类

当调用append方法的时候,会先进行调用open方法,最终调用的是HDFSWriter里面的open方法
public void open(String filePath) throws IOException {
    Configuration conf = new Configuration();
    Path dstPath = new Path(filePath);
    FileSystem hdfs = getDfs(conf, dstPath);
    doOpen(conf, dstPath, hdfs);
  }
可以点到里面看,会看到最终获取到了一个输出流
outStream = hdfs.create(dstPath);

然后会构造一个serializer, Flume 默认的 serializerType 配置是 TEXT,即使用 BodyTextEventSerializer 来序列化数据
    serializer = EventSerializerFactory.getInstance(
        serializerType, serializerContext, outStream);

		
回到BucketWriter的append方法
可以看到这里执行了写入event的操作
     callWithTimeout(new CallRunner<Void>() {
        @Override
        public Void call() throws Exception {
          writer.append(event); // could block
          return null;
        }
      });
		
实际上调用的就是HDFSWriter里面的append方法,而HDFSWriter的append实际上是调用了serializer的write方法。
 public void append(Event e) throws IOException {
    serializer.write(e);
 }		
通过上面我们得知根据我们的配置serializer的实现类是BodyTextEventSerializer		
因此最终调用了如下方法写入event到HDFS
public void write(Event e) throws IOException {
    out.write(e.getBody());
    if (appendNewline) {
      out.write('\n');
    }
  }		
可以看到默认情况每写入一个event,flume会帮我们在后面追加一个换行符,	这里可以通过配置去掉。

关于		
rollCount,rollSize两个参数的使用
BucketWriter中进行append的时候会调用 shouldRotate()方法判断是否需要rotate the file。
    
	//判断event的个数
    if ((rollCount > 0) && (rollCount <= eventCounter)) {
      LOG.debug("rolling: rollCount: {}, events: {}", rollCount, eventCounter);
      doRotate = true;
    }

    //这里判断 a1.sinks.k1.hdfs.rollSize = 1048576,如果设置的rollSize小于已经写入的大小,就需要进行滚动
    if ((rollSize > 0) && (rollSize <= processSize)) {
      LOG.debug("rolling: rollSize: {}, bytes: {}", rollSize, processSize);
      doRotate = true;
    }
	
关于processSize:		
processSize += event.getBody().length;	每个event的大小的累加

总结:

HDFSSink比较常用,这里结合简单的配置来阅读其代码,防止后续使用中遇到问题无从下手解决。
点击查看更多内容
TA 点赞

若觉得本文不错,就分享一下吧!

评论

作者其他优质文章

正在加载中
JAVA开发工程师
手记
粉丝
6396
获赞与收藏
157

关注作者,订阅最新文章

阅读免费教程

  • 推荐
  • 评论
  • 收藏
  • 共同学习,写下你的评论
感谢您的支持,我会继续努力的~
扫码打赏,你说多少就多少
赞赏金额会直接到老师账户
支付方式
打开微信扫一扫,即可进行扫码打赏哦
今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与 放弃机会
意见反馈 帮助中心 APP下载
官方微信

举报

0/150
提交
取消