Memstore Flush流程
为了减少flush过程对读写的影响,HBase采用了类似于两阶段提交的方式,将整个flush过程分为三个阶段:
prepare阶段:遍历当前Region中的所有Memstore,将Memstore中当前数据集kvset做一个快照snapshot,然后再新建一个新的kvset。后期的所有写入操作都会写入新的kvset中,而整个flush阶段读操作会首先分别遍历kvset和snapshot,如果查找不到再会到HFile中查找。prepare阶段需要加一把updateLock对写请求阻塞,结束之后会释放该锁。因为此阶段没有任何费时操作,因此持锁时间很短。
flush阶段:遍历所有Memstore,将prepare阶段生成的snapshot持久化为临时文件,临时文件会统一放到目录.tmp下。这个过程因为涉及到磁盘IO操作,因此相对比较耗时。
commit阶段:遍历所有的Memstore,将flush阶段生成的临时文件移到指定的ColumnFamily目录下,针对HFile生成对应的storefile和Reader,把storefile添加到HStore的storefiles列表中,最后再清空prepare阶段生成的snapshot。
日志分析
/******* MemStoreFlush初始化阶段 ********/2018-07-06 09:39:24,880 INFO [regionserver/host/ip:16020] regionserver.MemStoreFlusher: globalMemStoreLimit=1.5 G, globalMemStoreLimitLowMark=1.5 G, maxHeap=3.8 G/******* prepare阶段 ********/2018-07-06 18:33:31,329 INFO [MemStoreFlusher.1] regionserver.HRegion: Started memstore flush for [table],,1528539945017.80ab9764ae70fa97b75057c376726653., current region memstore size 21.73 MB, and 1/1 column families' memstores are being flushed./******* flush阶段 ********/2018-07-06 18:33:31,696 INFO [MemStoreFlusher.1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=40056, memsize=21.7 M, hasBloomFilter=true, into tmp file hdfs://ns/hbase/data/default/[table]/80ab9764ae70fa97b75057c376726653/.tmp/f71e7e8c15774da683bdecaf7cf6cb99/******* commit阶段 ********/2018-07-06 18:33:31,718 INFO [MemStoreFlusher.1] regionserver.HStore: Added hdfs://ns/hbase/data/default/[table]/80ab9764ae70fa97b75057c376726653/d/f71e7e8c15774da683bdecaf7cf6cb99, entries=119995, sequenceid=40056, filesize=7.3 M
源码分析
MemStoreFlusher初始化
public MemStoreFlusher(final Configuration conf, final HRegionServer server) { ... // hbase.server.thread.wakefrequency,检查MemStore的线程周期 this.threadWakeFrequency = conf.getLong(HConstants.THREAD_WAKE_FREQUENCY, 10 * 1000); // 获取JVM使用内存 long max = -1L; final MemoryUsage usage = HeapMemorySizeUtil.safeGetHeapMemoryUsage(); if (usage != null) { max = usage.getMax(); } float globalMemStorePercent = HeapMemorySizeUtil.getGlobalMemStorePercent(conf, true); // 全部的MemStore占用超过heap的upperLimit和lowerLimit this.globalMemStoreLimit = (long) (max * globalMemStorePercent); this.globalMemStoreLimitLowMarkPercent = HeapMemorySizeUtil.getGlobalMemStoreLowerMark(conf, globalMemStorePercent); this.globalMemStoreLimitLowMark = (long) (this.globalMemStoreLimit * this.globalMemStoreLimitLowMarkPercent); // flush阻塞时间,如果调低会加快flush速度,但是Compact需要配个,否则文件会越来越多 this.blockingWaitTime = conf.getInt("hbase.hstore.blockingWaitTime", 90000); // flush的线程数, 线程数越多, 增加HDFS的负载 int handlerCount = conf.getInt("hbase.hstore.flusher.count", 2); this.flushHandlers = new FlushHandler[handlerCount]; LOG.info("globalMemStoreLimit=" + TraditionalBinaryPrefix.long2String(this.globalMemStoreLimit, "", 1) + ", globalMemStoreLimitLowMark=" + TraditionalBinaryPrefix.long2String(this.globalMemStoreLimitLowMark, "", 1) + ", maxHeap=" + TraditionalBinaryPrefix.long2String(max, "", 1)); }
Flush启动
public class HRegionServer { private void startServiceThreads() { ... this.cacheFlusher.start(uncaughtExceptionHandler); } }public class MemStoreFlusher { synchronized void start(UncaughtExceptionHandler eh) { ThreadFactory flusherThreadFactory = Threads.newDaemonThreadFactory( server.getServerName().toShortString() + "-MemStoreFlusher", eh); for (int i = 0; i < flushHandlers.length; i++) { flushHandlers[i] = new FlushHandler("MemStoreFlusher." + i); flusherThreadFactory.newThread(flushHandlers[i]); flushHandlers[i].start(); } } }
FlushHandler多线程执行flush
private final BlockingQueue<FlushQueueEntry> flushQueue = new DelayQueue<FlushQueueEntry>(); // 无界的BlockingQueueprivate class FlushHandler extends HasThread { @Override public void run() { while (!server.isStopped()) { FlushQueueEntry fqe = null; try { wakeupPending.set(false); // allow someone to wake us up again fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS); // 从队列中取出一个flushrequest,如果flushQueue队列中没有值阻塞 if (fqe == null || fqe instanceof WakeupFlushThread) { // 如果没有flush request或者flush request是一个全局flush的request。 if (isAboveLowWaterMark()) { // 检查所有的memstore是否超过max_heap * hbase.regionserver.global.memstore.lowerLimit配置的值,默认0.35 // 超过配置的最小memstore的值,flush最大的一个memstore的region if (!flushOneForGlobalPressure()) { // 如果没有任何Region需要flush,但已经超过了lowerLimit。 // 这种情况不太可能发生,除非可能会在关闭整个服务器时发生,即有另一个线程正在执行flush regions。 // 只里只需要sleep一下,然后唤醒任何被阻塞的线程再次检查。 // HRegionServer执行数据更新的相关方法如果发现memstore的总和超过配置的最大值时,会wait更新线程,等待flush Thread.sleep(1000); wakeUpIfBlocking(); } // 发起另一个唤醒的全局flush request,生成WakeupFlushThread的request wakeupFlushThread(); } continue; } // 如果是正常的flush request // 单个region memstore大小超过hbase.hregion.memstore.flush.size配置的值,默认128M,执行flush操作 FlushRegionEntry fre = (FlushRegionEntry) fqe; if (!flushRegion(fre)) { break; } } catch (Exception ex) { ... } } // 结束MemStoreFlusher的线程调用,通常是regionserver stop synchronized (regionsInQueue) { regionsInQueue.clear(); flushQueue.clear(); } // 通知其他线程 wakeUpIfBlocking(); } }
取出所有Region中MemStore最大的一个Region,并执行flush操作
private boolean flushOneForGlobalPressure() { // 取出所有Region,以Size排序 SortedMap<Long, Region> regionsBySize = server.getCopyOfOnlineRegionsSortedBySize(); Set<Region> excludedRegions = new HashSet<Region>(); // 2.0版本Replica新增 // 如果最大的replica region的memstore已经超过了最大的主region memstore的内存的4倍,就主动触发一次StoreFile Refresher去更新文件列表 // 即获取hbase.region.replica.storefile.refresh.memstore.multiplier double secondaryMultiplier = ServerRegionReplicaUtil.getRegionReplicaStoreFileRefreshMultiplier(conf); boolean flushedOne = false; while (!flushedOne) { // 是按region的memstore的大小从大到小排序组成。取出满足以下条件的最大的memstore的region // 如果都不满足,返回null // bestFlushableRegion: // 1.region的writestate.flushing==false && writestate.writesEnabled==true // 2.region中所有的store中的storefile的个数小于hbase.hstore.blockingStoreFiles配置的值,默认为7 Region bestFlushableRegion = getBiggestMemstoreRegion(regionsBySize, excludedRegions, true); // bestAnyRegion: // 1.region的writestate.flushing==false && writestate.writesEnabled==true // 此处不检查region中是否有store的文件个数超过指定的配置值。 Region bestAnyRegion = getBiggestMemstoreRegion(regionsBySize, excludedRegions, false); // bestRegionReplica: // 1.region的replicaId!=0 Region bestRegionReplica = getBiggestMemstoreOfRegionReplica(regionsBySize, excludedRegions); // 如果没有拿到bestAnyRegion或bestRegionReplica,表示没有需要flush的region if (bestAnyRegion == null && bestRegionReplica == null) { return false; } Region regionToFlush; if (bestFlushableRegion != null && bestAnyRegion.getMemstoreSize() > 2 * bestFlushableRegion.getMemstoreSize()) { // 得到最需要进行flush的region // 如果bestAnyRegion(memstore最大的region的region)memory使用大小 // 超过bestFlushableRegion(storefile个数没有超过配置的memstore最大的region)的memory大小的2倍 // 优先flush掉此region的memstore,这里的设计为了防止在低压下做非常多的小flush,导致compaction // 代码注释: // Even if it's not supposed to be flushed, pick a region if it's more than twice as big as the best flushable one - otherwise when we're under pressure we make lots of little flushes and cause lots of compactions, etc, which just makes life worse! regionToFlush = bestAnyRegion; } else { if (bestFlushableRegion == null) { // 如果要flush的region中没有一个region的storefile个数没有超过配置的值 // 即所有region中都有store的file个数超过了配置的store最大storefile个数,优先flush掉memstore的占用最大的region regionToFlush = bestAnyRegion; } else { /** * 如果要flush的region中,有Region的Store还没有超过配置的最大StoreFile个数,优先flush这个Region * 目的是为了减少一小部分Region数据写入过热,compact太多,而数据写入较冷的region一直没有被flush */ regionToFlush = bestFlushableRegion } } ... if (regionToFlush == null || (bestRegionReplica != null && ServerRegionReplicaUtil.isRegionReplicaStoreFileRefreshEnabled(conf) && (bestRegionReplica.getMemstoreSize() > secondaryMultiplier * regionToFlush.getMemstoreSize()))) { /** * 开启Replica的逻辑 * RegionReplica存在,并且Replica的Size大于最大的主region memstore的内存的n倍 * 触发一次StoreFile Refresher去更新文件列表 * * 参考replica memstore过大导致写阻塞的问题 */ flushedOne = refreshStoreFilesAndReclaimMemory(bestRegionReplica); if (!flushedOne) { // always false excludedRegions.add(bestRegionReplica); } } else { /** * 执行flush操作,设置全局flush的标识为true * 如果flush操作出现错误,需要把此region添加到excludedRegions列表中,表示这次flush一个region的行为中跳过此region,找下一个memstore最大的region进行flush */ flushedOne = flushRegion(regionToFlush, true, true); if (!flushedOne) { excludedRegions.add(regionToFlush); } } } return true; }
flush region
Region数据落盘
private boolean flushRegion(final FlushRegionEntry fqe) { Region region = fqe.region; if (!region.getRegionInfo().isMetaRegion() && isTooManyStoreFiles(region)) { if (fqe.isMaximumWait(this.blockingWaitTime)) { LOG.info("Waited " + (EnvironmentEdgeManager.currentTime() - fqe.createTime) + "ms on a compaction to clean up 'too many store files'; waited " + "long enough... proceeding with flush of " + region.getRegionInfo().getRegionNameAsString()); } else { // If this is first time we've been put off, then emit a log message. if (fqe.getRequeueCount() <= 0) { // Note: We don't impose blockingStoreFiles constraint on meta regions LOG.warn("Region " + region.getRegionInfo().getRegionNameAsString() + " has too many " + "store files; delaying flush up to " + this.blockingWaitTime + "ms"); if (!this.server.compactSplitThread.requestSplit(region)) { try { this.server.compactSplitThread.requestSystemCompaction( region, Thread.currentThread().getName()); } catch (IOException e) { LOG.error("Cache flush failed for region " + Bytes.toStringBinary(region.getRegionInfo().getRegionName()), RemoteExceptionHandler.checkIOException(e)); } } } // Put back on the queue. Have it come back out of the queue // after a delay of this.blockingWaitTime / 100 ms. this.flushQueue.add(fqe.requeue(this.blockingWaitTime / 100)); // Tell a lie, it's not flushed but it's ok return true; } } return flushRegion(region, false, fqe.isForceFlushAllStores()); }
refreshStoreFilesAndReclaimMemory
开启Replica的逻辑
RegionReplica存在,并且Replica的Size大于最大的主region memstore的内存的n倍,触发一次StoreFile Refresher去更新文件列表
参考replica memstore过大导致写阻塞的问题
private boolean refreshStoreFilesAndReclaimMemory(Region region) { try { return region.refreshStoreFiles(); } catch (IOException e) { LOG.warn("Refreshing store files failed with exception", e); } return false; }
replica memstore过大导致写阻塞的问题
replica的region中memstore是不会主动flush的,只有收到主region的flush操作,才会去flush。
同一台RegionServer上可能有一些region replica和其他的主region同时存在。
这些replica可能由于复制延迟(没有收到flush marker),或者主region没有发生flush,导致一直占用内存不释放。
这会造成整体的内存超过水位线,导致正常的写入被阻塞。
为了防止这个问题的出现,HBase中有一个参数叫做hbase.region.replica.storefile.refresh.memstore.multiplier,默认值是4。
这个参数的意思是说,如果最大的replica region的memstore已经超过了最大的主region memstore的内存的4倍,就主动触发一次StoreFile Refresher去更新文件列表,如果确实发生了flush,那么replica内存里的数据就能被释放掉。
但是,这只是解决了replication延迟导致的未flush问题,如果这个replica的主region确实没有flush过,内存还是不能被释放。写入阻塞还是会存在
作者:Alex90
链接:https://www.jianshu.com/p/c1fee434caa3
共同学习,写下你的评论
评论加载中...
作者其他优质文章