前面我们分析了Spark中具体的Task的提交和运行过程,从本文开始我们开始进入Shuffle的世界,Shuffle对于分布式计算来说是至关重要的部分,它直接影响了分布式系统的性能,所以我将尽可能进行详细的分析。
我们首先来看Shuffle中的Write部分:
override def runTask(context: TaskContext): MapStatus = { // Deserialize the RDD using the broadcast variable.
val deserializeStartTime = System.currentTimeMillis() val ser = SparkEnv.get.closureSerializer.newInstance() val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
_executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
metrics = Some(context.taskMetrics) var writer: ShuffleWriter[Any, Any] = null
try { val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
} catch { case e: Exception => try { if (writer != null) {
writer.stop(success = false)
}
} catch { case e: Exception =>
log.debug("Could not stop writer", e)
} throw e
}
}首先根据SparkEnv获得ShuffleManager,ShuffleManager是为Spark shuffle系统而抽象的可插拔的接口,它被创建在Driver和Executor上,具体是在SparkEnv实例化的时候进行配置的,源码如下:
val shortShuffleMgrNames = Map( "hash" -> "org.apache.spark.shuffle.hash.HashShuffleManager", "sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager", "tungsten-sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager")val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")val shuffleMgrClass = shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)可以看到是由"spark.shuffle.manager"配置项来决定具体使用哪种实现方式,默认情况下使用的是sort的方式(本文参考的是Spark 1.6.3版本的源码)。Driver通过ShuffleManager来注册shuffles,Executors可以通过它来读写数据。
获得到ShuffleManager后,就根据它来获得ShuffleWriter(根据具体ShuffleManager的getWriter方法获得),顾名思义就是用来写数据,而接下来的工作就是调用具体的ShuffleWriter的write方法来进行写数据的工作。
先来看一下getWriter方法,这里的第一个参数dep.shuffleHandle是ShuffleDependency的一个成员变量:
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle( shuffleId, _rdd.partitions.size, this)
这里的registerShuffle方法用来向ShuffleManager注册一个shuffle并且获得一个用来传递任务的句柄,会根据不同ShuffleManager有不同的实现,HashShuffleManager返回的是BaseShuffleHandle,而SortShuffleManager又会根据不同的情况返回BypassMergeSortShuffleHandle、SerializedShuffleHandle或者BaseShuffleHandle。
HashShuffleManager:
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = { new BaseShuffleHandle(shuffleId, numMaps, dependency)
}SortShuffleManager:
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = { if (SortShuffleWriter.shouldBypassMergeSort(SparkEnv.get.conf, dependency)) { // 这里注释说的很清楚,根据spark.shuffle.sort.bypassMergeThreshold的值(默认是200)判断是否需要进行Map端的聚合操作
// 如果partitions的个数小于200就不进行该操作
// If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
// need map-side aggregation, then write numPartitions files directly and just concatenate
// them at the end. This avoids doing serialization and deserialization twice to merge
// together the spilled files, which would happen with the normal code path. The downside is
// having multiple files open at a time and thus more memory allocated to buffers.
new BypassMergeSortShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]]) // 这里是判断是否使用tungsten的方式
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) { // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
new SerializedShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else { // 如果不是上述两种方式,就使用默认的方式
// Otherwise, buffer map outputs in a deserialized form:
new BaseShuffleHandle(shuffleId, numMaps, dependency)
}
}使用一张图来总结一下上面的过程:
然后我们来看ShuffleManager的getWriter方法:
HashShuffleManager:
/** Get a writer for a given partition. Called on executors by map tasks. */override def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext)
: ShuffleWriter[K, V] = { new HashShuffleWriter(
shuffleBlockResolver, handle.asInstanceOf[BaseShuffleHandle[K, V, _]], mapId, context)
}SortShuffleManager:
/** Get a writer for a given partition. Called on executors by map tasks. */override def getWriter[K, V](
handle: ShuffleHandle,
mapId: Int,
context: TaskContext): ShuffleWriter[K, V] = {
numMapsForShuffle.putIfAbsent(
handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps) val env = SparkEnv.get
handle match { case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] => new UnsafeShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
context.taskMemoryManager(),
unsafeShuffleHandle,
mapId,
context,
env.conf) case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] => new BypassMergeSortShuffleWriter(
env.blockManager,
shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
bypassMergeSortHandle,
mapId,
context,
env.conf) case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] => new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
}
}从上述源码可以看出,getWriter方法内部实际上是根据传进来的ShuffleHandle的具体类型来判断使用哪种ShuffleWriter的,然后最终执行ShuffleWriter的write方法,下面我们就分为HashShuffleManager和SortShuffleManager两种类型来进行分析。
1、HashShuffleManager
从上面的源码中可以看到HashShuffleManager最终实例化的是HashShuffleWriter,实例化的时候有一行比较重要的代码:
private val shuffle = shuffleBlockResolver.forMapTask(dep.shuffleId, mapId, numOutputSplits, ser, writeMetrics)
我们来看forMapTask这个方法:
def forMapTask(shuffleId: Int, mapId: Int, numReducers: Int, serializer: Serializer,
writeMetrics: ShuffleWriteMetrics): ShuffleWriterGroup = { // 为每个ShuffleMapTask实例化了一个ShuffleWriterGroup
new ShuffleWriterGroup { // 实例化ShuffleState并保存shuffleId和ShuffleState的对应关系
shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numReducers)) // 根据shuffleId获得对应的ShuffleState
private val shuffleState = shuffleStates(shuffleId) val openStartTime = System.nanoTime val serializerInstance = serializer.newInstance() // 获得该ShuffleWriterGroup的writers
val writers: Array[DiskBlockObjectWriter] = { Array.tabulate[DiskBlockObjectWriter](numReducers) { bucketId => // 生成ShuffleBlockId,是一个case class,我们可以通过name方法看到其具体的组成:
// override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId
val blockId = ShuffleBlockId(shuffleId, mapId, bucketId) // 通过DiskBlockManager的getFile方法获得File
val blockFile = blockManager.diskBlockManager.getFile(blockId) // 临时目录
val tmp = Utils.tempFileWith(blockFile) // 使用BlockManager的getDiskWriter方法获得DiskBlockObjectWriter
// 注意这里的bufferSize默认情况下是32kb,可以通过spark.shuffle.file.buffer进行配置
// private val bufferSize = conf.getSizeAsKb("spark.shuffle.file.buffer", "32k").toInt * 1024
blockManager.getDiskWriter(blockId, tmp, serializerInstance, bufferSize, writeMetrics)
}
} // Creating the file to write to and creating a disk writer both involve interacting with
// the disk, so should be included in the shuffle write time.
writeMetrics.incShuffleWriteTime(System.nanoTime - openStartTime) override def releaseWriters(success: Boolean) {
shuffleState.completedMapTasks.add(mapId)
}
}
}为每个ShuffleMapTask实例化一个ShuffleWriterGroup,其中包含了一组writers,每个writer对应一个reducer。
然后我们进入到HashShuffleWriter的write方法:
/** Write a bunch of records to this task's output */override def write(records: Iterator[Product2[K, V]]): Unit = { val iter = if (dep.aggregator.isDefined) { // 判断是否进行map端的combine操作
if (dep.mapSideCombine) {
dep.aggregator.get.combineValuesByKey(records, context)
} else {
records
}
} else {
require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
records
} for (elem <- iter) { val bucketId = dep.partitioner.getPartition(elem._1)
shuffle.writers(bucketId).write(elem._1, elem._2)
}
}如果没有进行Map端的combine操作,根据key获得bucketId,实际上是进行取模运算:
def nonNegativeMod(x: Int, mod: Int): Int = { val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}下面就是根据bucketId获得ShuffleWriterGroup中对应的writer,然后执行其write方法,将key和value写入到对应的block file中:
/**
* Writes a key-value pair.
*/def write(key: Any, value: Any) { if (!initialized) {
open()
}
objOut.writeKey(key)
objOut.writeValue(value)
recordWritten()
}写入完成后,我们回到ShuffleMapTask的runTask方法中,接下来执行的是:
writer.stop(success = true).get
即HashShuffleWriter的stop方法,该方法返回的是Option[MapStatus],最主要的一句代码为:
Some(commitWritesAndBuildStatus())
进入到commitWritesAndBuildStatus方法:
private def commitWritesAndBuildStatus(): MapStatus = { // Commit the writes. Get the size of each bucket block (total block size).
val sizes: Array[Long] = shuffle.writers.map { writer: DiskBlockObjectWriter => // 调用DiskBlockObjectWriter的commitAndClose方法
writer.commitAndClose() // 获得每个bucket block的大小
writer.fileSegment().length
} // 重命名所有的shuffle文件,每个executor只有一个ShuffleBlockResolver,所以使用了synchronized关键字
// rename all shuffle files to final paths
// Note: there is only one ShuffleBlockResolver in executor
shuffleBlockResolver.synchronized {
shuffle.writers.zipWithIndex.foreach { case (writer, i) => val output = blockManager.diskBlockManager.getFile(writer.blockId) if (sizes(i) > 0) { if (output.exists()) { // Use length of existing file and delete our own temporary one
sizes(i) = output.length()
writer.file.delete()
} else { // Commit by renaming our temporary file to something the fetcher expects
if (!writer.file.renameTo(output)) { throw new IOException(s"fail to rename ${writer.file} to $output")
}
}
} else { if (output.exists()) {
output.delete()
}
}
}
} MapStatus(blockManager.shuffleServerId, sizes)
}这里我们需要注意最终返回的是封装的MapStatus,它记录了产生的磁盘文件的位置,然后Executor中的MapOutputTrackerWorker将MapStatus信息发送给Driver中的MapOutputTrackerMaster,后面Shuffle Read的之后就会从Driver的MapOutputTrackerMaster获取MapStatus的信息,也就是获取对应的上一个ShuffleMapTask的计算结果的输出的文件位置信息。
Map端combine的情况
再来补充一下Map端combine的情况:
if (dep.mapSideCombine) {
dep.aggregator.get.combineValuesByKey(records, context)
} else {
records
}进入到Aggregator的combineValuesByKey方法:
def combineValuesByKey(
iter: Iterator[_ <: Product2[K, V]],
context: TaskContext): Iterator[(K, C)] = { // 首先实例化ExternalAppendOnlyMap
val combiners = new ExternalAppendOnlyMap[K, V, C](createCombiner, mergeValue, mergeCombiners) // 执行ExternalAppendOnlyMap的insertAll方法
combiners.insertAll(iter)
updateMetrics(context, combiners)
combiners.iterator
}首先实例化了ExternalAppendOnlyMap,然后执行ExternalAppendOnlyMap的insertAll方法:
def insertAll(entries: Iterator[Product2[K, V]]): Unit = { if (currentMap == null) { throw new IllegalStateException( "Cannot insert new elements into a map after calling iterator")
} // An update function for the map that we reuse across entries to avoid allocating
// a new closure each time
var curEntry: Product2[K, V] = null
val update: (Boolean, C) => C = (hadVal, oldVal) => { if (hadVal) mergeValue(oldVal, curEntry._2) else createCombiner(curEntry._2)
} while (entries.hasNext) {
curEntry = entries.next() val estimatedSize = currentMap.estimateSize() if (estimatedSize > _peakMemoryUsedBytes) {
_peakMemoryUsedBytes = estimatedSize
} if (maybeSpill(currentMap, estimatedSize)) {
currentMap = new SizeTrackingAppendOnlyMap[K, C]
}
currentMap.changeValue(curEntry._1, update)
addElementsRead()
}
}具体的实现方式就不再解释了,简单的说就是将key相同的value进行合并,如果某个key有对应的值就执行merge(也可以理解为更新)操作,如果没有对应的值就新建一个combiner,需要注意的是如果内存不够的话就会将数据spill到磁盘。
HashShuffle方式的Shuffle Write部分至此结束,使用一张图概括一下:
接下来看一下SortShuffle方式的具体流程。
2、SortShuffleManager
为了解决Hash Shuffle产生小文件过多的问题,产生了Sort Shuffle,解下来我们就一起看一下Sort Shuffle的Write部分。
上文中我们已经提到,SortShuffleManager中的getWriter会根据不同的ShuffleHandle产生相应的ShuffleWriter:
SerializedShuffleHandle 对应 UnsafeShuffleWriter
BypassMergeSortShuffleHandle 对应 BypassMergeSortShuffleWriter
BaseShuffleHandle 对应 SortShuffleWriter
下面我们分别进行分析:
2.1、BaseShuffleHandle & SortShuffleWriter
首先来看一下SortShuffleWriter,直接来看它的write方法:
/** Write a bunch of records to this task's output */override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!") new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else { // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
// care whether the keys get sorted in each partition; that will be done on the reduce side
// if the operation being run is sortByKey.
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
sorter.insertAll(records) // Don't bother including the time to open the merged output file in the shuffle write time,
// because it just opens a single file, so is typically too fast to measure accurately
// (see SPARK-3570).
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId) val tmp = Utils.tempFileWith(output) try { val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID) val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally { if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}我们看到,内部有一个非常重要的部分,即ExternalSorter,而关于ExternalSorter的使用,源码中的注释说的很清楚了,这里就不做翻译了:
/** * Users interact with this class in the following way: * * 1. Instantiate an ExternalSorter. * * 2. Call insertAll() with a set of records. * * 3. Request an iterator() back to traverse sorted/aggregated records. * - or - * Invoke writePartitionedFile() to create a file containing sorted/aggregated outputs * that can be used in Spark's sort shuffle. */
我们就根据这三个步骤进行说明:
2.1.1 第一步
首先就是实例化ExternalSorter,这里有一个判断,如果要进行map端的combine操作的话就需要指定Aggregator和Ordering,否则这两个参数为None。我们熟悉的reduceByKey就进行了Map端的combine操作,如下图所示:
2.1.2 第二步(这一步非常重要)
通过判断是否进行Map端combine操作而实例化出不同的ExternalSorter后,就会调用insertAll方法,将输入的记录写入到内存中,如果内存不足就spill到磁盘中,具体的实现我们来看insertAll方法:
def insertAll(records: Iterator[Product2[K, V]]): Unit = { // TODO: stop combining if we find that the reduction factor isn't high
// 首先判断是否需要进行Map端的combine操作
val shouldCombine = aggregator.isDefined if (shouldCombine) { // 如果需要进行map端的combine操作,使用PartitionedAppendOnlyMap作为缓存
// 将record根据key对value按照获得的聚合函数进行聚合操作(combine)
// Combine values in-memory first using our AppendOnlyMap
// 获得聚合函数,例如我们使用reduceByKey时编写的函数
val mergeValue = aggregator.get.mergeValue // 获取createCombiner函数
val createCombiner = aggregator.get.createCombiner var kv: Product2[K, V] = null
// 定义update函数,主要的逻辑是:如果某个key已经存在记录(record)就使用上面获取
// 的聚合函数进行聚合操作,如果还不存在记录就使用createCombiner方法进行初始化操作
val update = (hadValue: Boolean, oldValue: C) => { if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
} // 循环遍历所有的records(记录)
while (records.hasNext) { // 记录spill的频率,每当read一条record的时候都会记录一次
addElementsRead() // 使用kv储存当前读的record
kv = records.next() // 这里的map和下面else中的buffer都是用来缓存的数据结构
// 如果进行Map端的聚合操作,使用的就是PartitionedAppendOnlyMap[K, C]
// 如果不进行Map端的聚合操作,使用的是PartitionedPairBuffer[K, C]
// 调用上面定义的update函数将记录插入到map中
map.changeValue((getPartition(kv._1), kv._1), update) // 判断是否要进行spill操作
maybeSpillCollection(usingMap = true)
}
} else { // 如果不需要进行Map端的聚合操作,就直接将记录放到buffer(PartitionedPairBuffer)中
// Stick values into our buffer
while (records.hasNext) {
addElementsRead() val kv = records.next()
buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
maybeSpillCollection(usingMap = false)
}
}
}具体的流程用注释的方式写在了上面的源码中,这里我们先来看一下PartitionedAppendOnlyMap和PartitionedPairBuffer分别是如何工作的:
PartitionedAppendOnlyMap:
首先来看PartitionedAppendOnlyMap的changeValue实现,实际上,PartitionedAppendOnlyMap是继承自SizeTrackingAppendOnlyMap,而SizeTrackingAppendOnlyMap又继承自AppendOnlyMap,这里调用的changeValue方法实际上是SizeTrackingAppendOnlyMap的changeValue方法:
override def changeValue(key: K, updateFunc: (Boolean, V) => V): V = { // 首先调用父类的changeValue方法
val newValue = super.changeValue(key, updateFunc) // 然后调用SizeTracker接口的afterUpdate方法
super.afterUpdate() // 返回newValue
newValue
}父类(AppendOnlyMap)的changeValue方法:
def changeValue(key: K, updateFunc: (Boolean, V) => V): V = {
assert(!destroyed, destructionMessage) val k = key.asInstanceOf[AnyRef] // key为空时候的处理,增加长度
if (k.eq(null)) { if (!haveNullValue) {
incrementSize()
}
nullValue = updateFunc(haveNullValue, nullValue)
haveNullValue = true
return nullValue
} var pos = rehash(k.hashCode) & mask var i = 1
while (true) { // 这里的data是一个数组,用来同时存储key和value:key0, value0, key1, value1, key2, value2, etc.
// 即2 * pos上存储的是key的值,2 * pos + 1上存储的是value的值
val curKey = data(2 * pos) // 如果key已经存在,就调用updateFunc方法更新value
if (k.eq(curKey) || k.equals(curKey)) { val newValue = updateFunc(true, data(2 * pos + 1).asInstanceOf[V])
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef] return newValue
} else if (curKey.eq(null)) { // 如果key不存在就将该key和对应的value添加到data这个数组中
val newValue = updateFunc(false, null.asInstanceOf[V])
data(2 * pos) = k
data(2 * pos + 1) = newValue.asInstanceOf[AnyRef]
incrementSize() return newValue
} else { // 否则继续计算位置(pos)
val delta = i
pos = (pos + delta) & mask
i += 1
}
} null.asInstanceOf[V] // Never reached but needed to keep compiler happy}然后是SizeTracker接口的afterUpdate方法
protected def afterUpdate(): Unit = {
numUpdates += 1
if (nextSampleNum == numUpdates) {
takeSample()
}
}更新数据的更新次数,如果更新的次数达到nextSampleNum,就执行采样操作,主要用来评估内存的使用情况。
PartitionedPairBuffer:
再来看PartitionedPairBuffer的insert方法,也就是不进行Map端combine操作的情况:
def insert(partition: Int, key: K, value: V): Unit = { // 如果当前的大小达到了capacity的值就需要扩大该数组
if (curSize == capacity) {
growArray()
} // 存储key,这里存储的是(partition Id, key)的格式
data(2 * curSize) = (partition, key.asInstanceOf[AnyRef]) // 存储value
data(2 * curSize + 1) = value.asInstanceOf[AnyRef]
curSize += 1
// 参考上面PartitionedAppendOnlyMap的部分
afterUpdate()
}直接将数据存储到buffer中。
执行完上面的更新数据操作后,就要判断是否要将数据spill到磁盘,即maybeSpillCollection方法:
private def maybeSpillCollection(usingMap: Boolean): Unit = { var estimatedSize = 0L // 这里需要判断使用的是map(PartitionedAppendOnlyMap)还是buffer(PartitionedPairBuffer)
// 如果true就是map,false就是buffer
if (usingMap) { // 估计当前map的内存占用大小
estimatedSize = map.estimateSize() // 如果超过内存的限制,就将缓存中的数据spill到磁盘
if (maybeSpill(map, estimatedSize)) { // spill到磁盘后,重置缓存
map = new PartitionedAppendOnlyMap[K, C]
}
} else { // 不进行Map端聚合操作的情况
estimatedSize = buffer.estimateSize() if (maybeSpill(buffer, estimatedSize)) {
buffer = new PartitionedPairBuffer[K, C]
}
} if (estimatedSize > _peakMemoryUsedBytes) {
_peakMemoryUsedBytes = estimatedSize
}
}上面代码的主要作用就是估计当前缓存(map或者buffer)使用内存的大小,如果超过了内存使用的限制,就要将缓存中的数据spill到磁盘中,同时重置当前的缓存。
下面就来看一下maybeSpill方法:
// 如果成功spill到磁盘就返回true,否则返回falseprotected def maybeSpill(collection: C, currentMemory: Long): Boolean = { var shouldSpill = false
// 在进行真正的spill操作之前向TaskMemoryManager申请再多分配一些内存
if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { // Claim up to double our current memory from the shuffle memory pool
val amountToRequest = 2 * currentMemory - myMemoryThreshold val granted =
taskMemoryManager.acquireExecutionMemory(amountToRequest, MemoryMode.ON_HEAP, null)
myMemoryThreshold += granted // If we were granted too little memory to grow further (either tryToAcquire returned 0,
// or we already had more memory than myMemoryThreshold), spill the current collection
// 如果内存仍然不够用,就认定为需要spill到磁盘
shouldSpill = currentMemory >= myMemoryThreshold
} // 如果内存中元素的个数超过了强制spill的上限也会认定为需要进行spill操作
shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold // Actually spill
// 接下来就真正将数据spill到磁盘
if (shouldSpill) {
_spillCount += 1
logSpillage(currentMemory) // spill操作
spill(collection)
_elementsRead = 0
_memoryBytesSpilled += currentMemory
releaseMemory()
}
shouldSpill
}在进行真正的spill操作之前会向TaskMemoryManager申请再多分配一些内存,如果还不能够满足,或者不能分配更多的内存,或者内存中元素的个数超过了强制spill的上限,最终就会执行spill操作,接下来进入spill方法:
// 这里的collection就是指的map或者bufferoverride protected[this] def spill(collection: WritablePartitionedPairCollection[K, C]): Unit = { // Because these files may be read during shuffle, their compression must be controlled by
// spark.shuffle.compress instead of spark.shuffle.spill.compress, so we need to use
// createTempShuffleBlock here; see SPARK-3426 for more context.
// 获取临时的BlockId(TempShuffleBlockId)及对应的File
val (blockId, file) = diskBlockManager.createTempShuffleBlock() // These variables are reset after each flush
var objectsWritten: Long = 0
var spillMetrics: ShuffleWriteMetrics = null
var writer: DiskBlockObjectWriter = null
def openWriter(): Unit = {
assert (writer == null && spillMetrics == null)
spillMetrics = new ShuffleWriteMetrics
writer = blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, spillMetrics)
} // 获得DiskWriter(DiskBlockObjectWriter)
openWriter() // List of batch sizes (bytes) in the order they are written to disk
// 用来储存每个batch对应的size
val batchSizes = new ArrayBuffer[Long] // How many elements we have in each partition
// 用来储存每个partition有多少元素
val elementsPerPartition = new Array[Long](numPartitions) // Flush the disk writer's contents to disk, and update relevant variables.
// The writer is closed at the end of this process, and cannot be reused.
def flush(): Unit = { val w = writer
writer = null
w.commitAndClose()
_diskBytesSpilled += spillMetrics.shuffleBytesWritten
batchSizes.append(spillMetrics.shuffleBytesWritten)
spillMetrics = null
objectsWritten = 0
} var success = false
try { // 排序部分的操作,返回迭代器
val it = collection.destructiveSortedWritablePartitionedIterator(comparator) // 循环的到的迭代器,执行write操作
while (it.hasNext) { val partitionId = it.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions, s"partition Id: ${partitionId} should be in the range [0, ${numPartitions})")
it.writeNext(writer)
elementsPerPartition(partitionId) += 1
objectsWritten += 1
// 如果写的对象达到serializerBatchSize的个数时就进行flush操作
if (objectsWritten == serializerBatchSize) {
flush()
openWriter()
}
} if (objectsWritten > 0) {
flush()
} else if (writer != null) { val w = writer
writer = null
w.revertPartialWritesAndClose()
}
success = true
} finally { if (!success) { // This code path only happens if an exception was thrown above before we set success;
// close our stuff and let the exception be thrown further
if (writer != null) {
writer.revertPartialWritesAndClose()
} if (file.exists()) { if (!file.delete()) {
logWarning(s"Error deleting ${file}")
}
}
}
} // 实例化SpilledFile,并保存在数据结构ArrayBuffer[SpilledFile]中
spills.append(SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition))
}先来简单的看一下排序部分的逻辑:
def destructiveSortedWritablePartitionedIterator(keyComparator: Option[Comparator[K]])
: WritablePartitionedIterator = { // 这里的partitionedDestructiveSortedIterator会根据是map或者buffer有不同的实现
val it = partitionedDestructiveSortedIterator(keyComparator) // 最后返回的是WritablePartitionedIterator,上面进行写操作的时候就是调用该迭代器中的writeNext方法
new WritablePartitionedIterator { private[this] var cur = if (it.hasNext) it.next() else null
def writeNext(writer: DiskBlockObjectWriter): Unit = {
writer.write(cur._1._2, cur._2)
cur = if (it.hasNext) it.next() else null
} def hasNext(): Boolean = cur != null
def nextPartition(): Int = cur._1._1
}
}如果collection是map,具体的实现为(PartitionedAppendOnlyMap):
def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
: Iterator[((Int, K), V)] = { val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator)
destructiveSortedIterator(comparator)
}如果collection是buffer,具体的实现为(PartitionedPairBuffer):
override def partitionedDestructiveSortedIterator(keyComparator: Option[Comparator[K]])
: Iterator[((Int, K), V)] = { val comparator = keyComparator.map(partitionKeyComparator).getOrElse(partitionComparator) new Sorter(new KVArraySortDataFormat[(Int, K), AnyRef]).sort(data, 0, curSize, comparator)
iterator
}这里需要注意,首先都是获取比较器,比较的数据格式为((partition Id, key), value),而两者的区别在于前者才是真正的Destructive级别的,具体的实现在destructiveSortedIterator方法中,而不管采用哪种方式,其底层都是通过timSort算法实现的,具体的排序逻辑就不详细说明了,有兴趣的朋友可以深入研究下去。
接下来就进入到第三步。
2.1.3 第三步
先贴出该步骤的代码:
...// 这里注释说的很清楚了,只打开了一个文件// Don't bother including the time to open the merged output file in the shuffle write time,// because it just opens a single file, so is typically too fast to measure accurately// (see SPARK-3570).val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)val tmp = Utils.tempFileWith(output)try { // 构造blockId
val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID) // 写数据
val partitionLengths = sorter.writePartitionedFile(blockId, tmp) // 写index文件
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp) // 进行Shuffle Read的时候需要参考该信息
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally { if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}writePartitionedFile:
首先来看writePartitionedFile方法:
def writePartitionedFile(
blockId: BlockId,
outputFile: File): Array[Long] = { // Track location of each range in the output file
val lengths = new Array[Long](numPartitions) // 首先判断spills中是否有数据,即判断是否有数据被spill到了磁盘中
if (spills.isEmpty) { // 数据只在内存中的情况
// Case where we only have in-memory data
val collection = if (aggregator.isDefined) map else buffer // 获得迭代器
val it = collection.destructiveSortedWritablePartitionedIterator(comparator) // 进行迭代并将数据写到磁盘
while (it.hasNext) { val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
context.taskMetrics.shuffleWriteMetrics.get) val partitionId = it.nextPartition() while (it.hasNext && it.nextPartition() == partitionId) {
it.writeNext(writer)
}
writer.commitAndClose() val segment = writer.fileSegment() // 最后返回的是每个partition写入的数据的长度
lengths(partitionId) = segment.length
}
} else { // 如果有数据被spill到了磁盘中,我们就需要进行merge-sort操作
// We must perform merge-sort; get an iterator by partition and write everything directly.
for ((id, elements) <- this.partitionedIterator) { if (elements.hasNext) { val writer = blockManager.getDiskWriter(blockId, outputFile, serInstance, fileBufferSize,
context.taskMetrics.shuffleWriteMetrics.get) for (elem <- elements) {
writer.write(elem._1, elem._2)
}
writer.commitAndClose() val segment = writer.fileSegment()
lengths(id) = segment.length
}
}
}
context.taskMetrics().incMemoryBytesSpilled(memoryBytesSpilled)
context.taskMetrics().incDiskBytesSpilled(diskBytesSpilled)
context.internalMetricsToAccumulators( InternalAccumulator.PEAK_EXECUTION_MEMORY).add(peakMemoryUsedBytes)
lengths
}下面我们就看一下this.partitionedIterator即内存和磁盘中的数据是如何合到一起的:
def partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])] = { val usingMap = aggregator.isDefined val collection: WritablePartitionedPairCollection[K, C] = if (usingMap) map else buffer if (spills.isEmpty) { // Special case: if we have only in-memory data, we don't need to merge streams, and perhaps
// we don't even need to sort by anything other than partition ID
if (!ordering.isDefined) { // The user hasn't requested sorted keys, so only sort by partition ID, not key
groupByPartition(collection.partitionedDestructiveSortedIterator(None))
} else { // We do need to sort by both partition ID and key
groupByPartition(collection.partitionedDestructiveSortedIterator(Some(keyComparator)))
}
} else { // Merge spilled and in-memory data
merge(spills, collection.partitionedDestructiveSortedIterator(comparator))
}
}我们只考虑spills不为空的情况,即执行merge方法:
private def merge(spills: Seq[SpilledFile], inMemory: Iterator[((Int, K), C)])
: Iterator[(Int, Iterator[Product2[K, C]])] = { // 根据每个SpilledFile实例化一个SpillReader,这些SpillReader组成一个Seq
val readers = spills.map(new SpillReader(_)) // 获得内存BufferedIterator
val inMemBuffered = inMemory.buffered // 根据partition的个数进行迭代
(0 until numPartitions).iterator.map { p => // 实例化IteratorForPartition,即当前partition下的Iterator
val inMemIterator = new IteratorForPartition(p, inMemBuffered) // 这里就是合并操作
val iterators = readers.map(_.readNextPartition()) ++ Seq(inMemIterator) if (aggregator.isDefined) { // Perform partial aggregation across partitions
// 如果需要map端的combine操作则需要根据key进行聚合操作
(p, mergeWithAggregation(
iterators, aggregator.get.mergeCombiners, keyComparator, ordering.isDefined))
} else if (ordering.isDefined) { // No aggregator given, but we have an ordering (e.g. used by reduce tasks in sortByKey);
// sort the elements without trying to merge them
// 排序合并,例如sortByKey
(p, mergeSort(iterators, ordering.get))
} else {
(p, iterators.iterator.flatten)
}
}
}具体的mergeWithAggregation和mergeSort就不一一说明了,下面再来看一下writeIndexFileAndCommit
writeIndexFileAndCommit:
再来看writeIndexFileAndCommit方法:
def writeIndexFileAndCommit(
shuffleId: Int,
mapId: Int,
lengths: Array[Long],
dataTmp: File): Unit = { val indexFile = getIndexFile(shuffleId, mapId) val indexTmp = Utils.tempFileWith(indexFile) try { val out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(indexTmp))) Utils.tryWithSafeFinally { // We take in lengths of each block, need to convert it to offsets.
var offset = 0L
out.writeLong(offset) for (length <- lengths) {
offset += length
out.writeLong(offset)
}
} {
out.close()
} val dataFile = getDataFile(shuffleId, mapId) // There is only one IndexShuffleBlockResolver per executor, this synchronization make sure
// the following check and rename are atomic.
synchronized { val existingLengths = checkIndexAndDataFile(indexFile, dataFile, lengths.length) if (existingLengths != null) { // Another attempt for the same task has already written our map outputs successfully,
// so just use the existing partition lengths and delete our temporary map outputs.
System.arraycopy(existingLengths, 0, lengths, 0, lengths.length) if (dataTmp != null && dataTmp.exists()) {
dataTmp.delete()
}
indexTmp.delete()
} else { // This is the first successful attempt in writing the map outputs for this task,
// so override any existing index and data files with the ones we wrote.
if (indexFile.exists()) {
indexFile.delete()
} if (dataFile.exists()) {
dataFile.delete()
} if (!indexTmp.renameTo(indexFile)) { throw new IOException("fail to rename file " + indexTmp + " to " + indexFile)
} if (dataTmp != null && dataTmp.exists() && !dataTmp.renameTo(dataFile)) { throw new IOException("fail to rename file " + dataTmp + " to " + dataFile)
}
}
}
} finally { if (indexTmp.exists() && !indexTmp.delete()) {
logError(s"Failed to delete temporary index file at ${indexTmp.getAbsolutePath}")
}
}
}具体的实现就不详细说明了,主要就是根据上一步的到的partition长度的数组将偏移量写入到index文件中。
最后就是实例化MapStatus,shuffle read的时候根据MapStatus获取数据。至此BaseShuffleHandle & SortShuffleWriter的部分就结束了。
BypassMergeSortShuffleHandle & BypassMergeSortShuffleWriter
再来看一下BypassMergeSortShuffleWriter的write方法:
public void write(Iterator<Product2<K, V>> records) throws IOException {
assert (partitionWriters == null); if (!records.hasNext()) {
partitionLengths = new long[numPartitions];
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths); return;
} final SerializerInstance serInstance = serializer.newInstance(); final long openStartTime = System.nanoTime();
partitionWriters = new DiskBlockObjectWriter[numPartitions]; // 针对每一个reducer建立一个临时文件
for (int i = 0; i < numPartitions; i++) { final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
blockManager.diskBlockManager().createTempShuffleBlock(); final File file = tempShuffleBlockIdPlusFile._2(); final BlockId blockId = tempShuffleBlockIdPlusFile._1();
partitionWriters[i] =
blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics).open();
} // Creating the file to write to and creating a disk writer both involve interacting with
// the disk, and can take a long time in aggregate when we open many files, so should be
// included in the shuffle write time.
writeMetrics.incShuffleWriteTime(System.nanoTime() - openStartTime); // 根据partition将记录写入到不同的临时文件中
while (records.hasNext()) { final Product2<K, V> record = records.next(); final K key = record._1();
partitionWriters[partitioner.getPartition(key)].write(key, record._2());
} for (DiskBlockObjectWriter writer : partitionWriters) {
writer.commitAndClose();
} // 将所有的临时文件内容按照partition Id合并到一个文件
File output = shuffleBlockResolver.getDataFile(shuffleId, mapId); File tmp = Utils.tempFileWith(output); try { // 将记录和partition长度信息分别写入到data文件和index文件中
partitionLengths = writePartitionedFile(tmp);
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
} finally { if (tmp.exists() && !tmp.delete()) {
logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
}
}
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}该writer在进行写记录之前会根据reducer的个数(例如R个)生成R个临时文件,然后将记录写入对应的临时文件中,最后将这些文件进行合并操作并写入到一个文件中,由于直接将记录写入了临时文件,并没有缓存在内存中,所以如果reducer的个数过多的话,就会为每个reducer打开一个临时文件,如果reducer的数量过多的话就会影响性能,所以使用该种方式需要满足一下条件(下面是源码中的注释):
no Ordering is specified.
no Aggregator is specified.
the number of partitions is less than spark.shuffle.sort.bypassMergeThreshold.
其中spark.shuffle.sort.bypassMergeThreshold的个数默认为200个。
SerializedShuffleHandle & UnsafeShuffleWriter
最后再来看一下UnsafeShuffleWriter,也就是通常所说的Tungsten。
UnsafeShuffleWriter的write方法:
public void write(scala.collection.Iterator<Product2<K, V>> records) throws IOException { // Keep track of success so we know if we encountered an exception
// We do this rather than a standard try/catch/re-throw to handle
// generic throwables.
boolean success = false; try { while (records.hasNext()) { // 循环便利所有记录,对其作用insertRecordIntoSorter方法
insertRecordIntoSorter(records.next());
} // 将数据输出到磁盘上
closeAndWriteOutput();
success = true;
} finally { if (sorter != null) { try {
sorter.cleanupResources();
} catch (Exception e) { // Only throw this error if we won't be masking another
// error.
if (success) { throw e;
} else {
logger.error("In addition to a failure during writing, we failed during " + "cleanup.", e);
}
}
}
}
}首先来看insertRecordIntoSorter:
void insertRecordIntoSorter(Product2<K, V> record) throws IOException { assert(sorter != null); final K key = record._1(); final int partitionId = partitioner.getPartition(key);
serBuffer.reset();
serOutputStream.writeKey(key, OBJECT_CLASS_TAG);
serOutputStream.writeValue(record._2(), OBJECT_CLASS_TAG);
serOutputStream.flush(); final int serializedRecordSize = serBuffer.size(); assert (serializedRecordSize > 0);
sorter.insertRecord(
serBuffer.getBuf(), Platform.BYTE_ARRAY_OFFSET, serializedRecordSize, partitionId);
}可以看出实际上调用的是ShuffleExternalSorter的insertRecord方法,限于篇幅,具体的底层实现暂不说明,以后有时间会单独分析一下Tungsten的部分,接下来看一下closeAndWriteOutput方法:
void closeAndWriteOutput() throws IOException { assert(sorter != null);
updatePeakMemoryUsed();
serBuffer = null;
serOutputStream = null; // 获得spilled的文件
final SpillInfo[] spills = sorter.closeAndGetSpills();
sorter = null; final long[] partitionLengths; // 最终的输出文件
final File output = shuffleBlockResolver.getDataFile(shuffleId, mapId); // 临时文件
final File tmp = Utils.tempFileWith(output); try { try { // 将spilled的文件合并并写入到临时文件
partitionLengths = mergeSpills(spills, tmp);
} finally { for (SpillInfo spill : spills) { if (spill.file.exists() && ! spill.file.delete()) {
logger.error("Error while deleting spill file {}", spill.file.getPath());
}
}
} // 将partition的长度信息写入index文件
shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
} finally { if (tmp.exists() && !tmp.delete()) {
logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
}
} // mapStatus
mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);
}用图来总结一下上面描述的三种方式如下所示:
BaseShuffleHandle & SortShuffleWriter
SerializedShuffleHandle & UnsafeShuffleWriter
BypassMergeSortShuffleHandle & BypassMergeSortShuffleWriter
最后需要说明的是如果采用的是SortShuffleManager,最后每个task产生的文件的个数为2 * M(M代表Mapper端ShuffleMapTask的个数),相对于Hash的方式来说文件的个数明显减少。
至此Shuffle Write的部分就分析完了,下一遍文章会继续分析Shuffle Read的部分。
本文为原创,欢迎转载,转载请注明出处、作者,谢谢!
作者:sun4lower
链接:https://www.jianshu.com/p/766290f06329
共同学习,写下你的评论
评论加载中...
作者其他优质文章






