上一篇博客详细分析了Spark在Standalone模式下的部署过程,文中提到在Worker注册完成后需要执行一个schedule操作来分配资源,本文就将具体分析此方法具体是怎样分配资源的。
注:本专题的文章皆使用Spark-1.6.3版本的源码为参考,如果Spark-2.1.0版本有重大改进的地方也会进行说明。
什么时候会调用schedule?
其实每当一个新的application加入或者资源发生变化的时候都会调用schudule方法对资源进行重新分配,那么它是如何分配资源的呢?我们下面进行源码级别的分析。
schedule
我们先贴出schedule的源码:
// 既然要分配资源就必须保证Master的当前状态为ALIVEif (state != RecoveryState.ALIVE) { return}// Drivers take strict precedence over executors// 注释说的很明确,先注册Drivers然后再注册executors// 1. 首先将ALIVE状态的Workers使用shuffle的方式打乱,以免每次都将Driver分配到同一个Worker上val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))val numWorkersAlive = shuffledAliveWorkers.sizevar curPos = 0// 2. 循环遍历启动Driversfor (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers // We assign workers to each waiting driver in a round-robin fashion. For each driver, we // start from the last worker that was assigned a driver, and continue onwards until we have // explored all alive workers. var launched = false var numWorkersVisited = 0 // 2.1 判断是否有剩余的没有分配的Workers,并且尚未启动 while (numWorkersVisited < numWorkersAlive && !launched) { // 2.2 获取一个Worker,第一个的索引为0,后面的索引根据curPos = (curPos + 1) % numWorkersAlive进行计算 val worker = shuffledAliveWorkers(curPos) // 2.3 标记分配过的Worker加1 numWorkersVisited += 1 // 2.4 判断当前的Worker的内存和cpu是否满足Driver的需求 if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) { // 2.5 如果满足资源的需求就在当前Worker上启动Driver launchDriver(worker, driver) // 2.6 启动完成后从等待的队列中删除,并将launched标记为true waitingDrivers -= driver launched = true } curPos = (curPos + 1) % numWorkersAlive } }// 3 启动ExecutorsstartExecutorsOnWorkers()
启动Driver
我已经在上面的源码中对分配的流程进行了详细的注释,现在我们看一下launchDriver方法:
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) { // 1. 打日志 logInfo("Launching driver " + driver.id + " on worker " + worker.id) // 2. 向worker中添加driver的信息,包括增加已经使用的内存和cpu信息 worker.addDriver(driver) // 3. 向driver中添加该worker的引用 driver.worker = Some(worker) // 4. 向Worker发送LaunchDriver的消息,通知Worker启动Driver worker.endpoint.send(LaunchDriver(driver.id, driver.desc)) // 5. 将driver的状态变成RUNNING driver.state = DriverState.RUNNING}
接下来我们看一下对应的Worker在接收到LaunchDriver消息后是怎么处理的:
case LaunchDriver(driverId, driverDesc) => { // 1. 打日志 logInfo(s"Asked to launch driver $driverId") // 2. 实例化DriverRunner val driver = new DriverRunner( conf, driverId, workDir, sparkHome, driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)), self, workerUri, securityMgr) // 3. 实例化完成后向drivers中添加该driver的记录 drivers(driverId) = driver // 4. 启动driver driver.start() // 5. 启动完成后记录资源的变化 coresUsed += driverDesc.cores memoryUsed += driverDesc.mem }
继续跟踪driver.start():
// 英文注释说的很清楚:启动一个线程来运行和管理driver/** Starts a thread to run and manage the driver. */private[worker] def start() = { new Thread("DriverRunner for " + driverId) { override def run() { try { // 创建driver的工作目录 val driverDir = createWorkingDirectory() // 下载用户的Jar文件到driver的工作目录并返回路径名称 val localJarFilename = downloadUserJar(driverDir) def substituteVariables(argument: String): String = argument match { case "{{WORKER_URL}}" => workerUrl case "{{USER_JAR}}" => localJarFilename case other => other } // TODO: If we add ability to submit multiple jars they should also be added here val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager, driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables) // 具体的启动Driver的操作,这里不再详细分析 launchDriver(builder, driverDir, driverDesc.supervise) } catch { case e: Exception => finalException = Some(e) } val state = if (killed) { DriverState.KILLED } else if (finalException.isDefined) { DriverState.ERROR } else { finalExitCode match { case Some(0) => DriverState.FINISHED case _ => DriverState.FAILED } } finalState = Some(state) worker.send(DriverStateChanged(driverId, state, finalException)) } }.start() }
如果启动成功最后要向worker发送一条DriverStateChanged的消息,而Worker在接收到该消息后会调用handleDriverStateChanged方法进行一系列处理,具体的处理细节就不再说明,主要的就是向Master发送一条driverStateChanged的消息,Master在接收到该消息后移除Driver的信息:
case DriverStateChanged(driverId, state, exception) => { state match { case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED => removeDriver(driverId, state, exception) case _ => throw new Exception(s"Received unexpected state update for driver $driverId: $state") } }
至此向Driver分配资源并启动Driver的过程结束,下面我们看一下启动Executors即执行startExecutorsOnWorkers()的流程。
启动Executors
startExecutorsOnWorkers():
/** * Schedule and launch executors on workers */ private def startExecutorsOnWorkers(): Unit = { // 采用的是先进先出的原则 // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app // in the queue, then the second app, etc. for (app <- waitingApps if app.coresLeft > 0) { val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor // Filter out workers that don't have enough resources to launch an executor // 过滤出ALIVE状态并且资源满足要求的workers,同时按照空闲cpu cores的个数倒序排列 val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && worker.coresFree >= coresPerExecutor.getOrElse(1)) .sortBy(_.coresFree).reverse // 决定在每个worker上面分配多少个cpu cores val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps) // 然后开始进行分配 // Now that we've decided how many cores to allocate on each worker, let's allocate them for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) { allocateWorkerResourceToExecutors( app, assignedCores(pos), coresPerExecutor, usableWorkers(pos)) } } }
我们首先看一下是如何决定在每个worker上分配多少个cores的,这里我们只列出scheduleExecutorsOnWorkers方法的英文注释,并进行说明,具体的操作大家可以去看源码:
/** * Schedule executors to be launched on the workers. * Returns an array containing number of cores assigned to each worker. * * There are two modes of launching executors. The first attempts to spread out an application's * executors on as many workers as possible, while the second does the opposite (i.e. launch them * on as few workers as possible). The former is usually better for data locality purposes and is * the default. * * The number of cores assigned to each executor is configurable. When this is explicitly set, * multiple executors from the same application may be launched on the same worker if the worker * has enough cores and memory. Otherwise, each executor grabs all the cores available on the * worker by default, in which case only one executor may be launched on each worker. * * It is important to allocate coresPerExecutor on each worker at a time (instead of 1 core * at a time). Consider the following example: cluster has 4 workers with 16 cores each. * User requests 3 executors (spark.cores.max = 48, spark.executor.cores = 16). If 1 core is * allocated at a time, 12 cores from each worker would be assigned to each executor. * Since 12 < 16, no executors would launch [SPARK-8881]. */
大致意思是说有两种分配模型,第一种是将executors分配到尽可能多的workers上;第二种与第一种相反。默认使用的是第一种模型,这种模型更加符合数据的本地性原则,为每个Executor分配的cores的个数是可以进行配置的(spark-submit 或者 spark-env.sh),如果设置了,多个executors可能会被分配在一个worker上(前提是该worker拥有足够的cores和memory),否则每个executor会充分利用worker上的cores,这种情况下一个executor会被分配在一个worker上。具体在集群上分配cores的时候会尽可能的满足我们的要求,如果需要的cores的个数大于workers中空闲的cores的个数,那么就先分配空闲的cores,尽可能的去满足要求。
接下来就是具体为executors分配计算资源并启动executors的过程:
private def allocateWorkerResourceToExecutors( app: ApplicationInfo, assignedCores: Int, coresPerExecutor: Option[Int], worker: WorkerInfo): Unit = { // If the number of cores per executor is specified, we divide the cores assigned // to this worker evenly among the executors with no remainder. // Otherwise, we launch a single executor that grabs all the assignedCores on this worker. val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1) val coresToAssign = coresPerExecutor.getOrElse(assignedCores) for (i <- 1 to numExecutors) { // 向application中添加executor的信息 val exec = app.addExecutor(worker, coresToAssign) // 启动executors launchExecutor(worker, exec) app.state = ApplicationState.RUNNING } }
启动executors:
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = { logInfo("Launching executor " + exec.fullId + " on worker " + worker.id) worker.addExecutor(exec) // 向worker发消息启动executor worker.endpoint.send(LaunchExecutor(masterUrl, exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)) // 然后向driver发送executors的信息 exec.application.driver.send( ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)) }
worker在接收到启动executor的消息后执行具体的启动操作,并向Master汇报。
然后也要向driver发送executors的资源信息,driver收到信息后执行application,至此分配并启动executors的大致流程也就执行完毕。
最后用一张图总结一下启动Driver和Worker的简易流程:
本文只是大致的分析了Master在执行schedule的时候具体为Driver、Executors分配资源并启动它们的流程,以后我们还会分析整个application的运行流程,那时我们会具体进行分析。
本文为原创,欢迎转载,转载请注明出处、作者,谢谢!
作者:sun4lower
链接:https://www.jianshu.com/p/153ec6adf83c
共同学习,写下你的评论
评论加载中...
作者其他优质文章