一、上下文
《Kafka-broker处理producer请求-leader篇》博客中我们分析了leader是如何处理producer的请求,但是并没有看到如何向follower写数据,这是因为kafka采取的是follower拉取的模式,也就是说每个borker既有某些分区的leader角色,也有其他分区的follower角色。自broker启动后就开始不停的处理producer的请求,并且不停的从对应leader同步数据来达到HW的增长。下面我们来分析下follower是如何同步leader数据的
二、follower发起fetch请求
《Kafka-确定broker中的分区是leader还是follower》博客中我们了解了borker成为follower的过程,且向副本拉取管理器中添加要拉取的分区信息。接下来我们看下它是如何同步leader数据的。
1、AbstractFetcherManager
为每一个分区创建一个线程去leader拉取数据,如果这些分区的leader指向同一个broker,那么使用同一个线程。
def addFetcherForPartitions(partitionAndOffsets: Map[TopicPartition, InitialFetchState]): Unit = {//每一个分区制作一个拉取器val partitionsPerFetcher = partitionAndOffsets.groupBy { case (topicPartition, brokerAndInitialFetchOffset) =>BrokerAndFetcherId(brokerAndInitialFetchOffset.leader, getFetcherId(topicPartition))}//def addAndStartFetcherThread(brokerAndFetcherId: BrokerAndFetcherId,brokerIdAndFetcherId: BrokerIdAndFetcherId): T = {//为每一个分区拉取器创建一个线程,并启动起来val fetcherThread = createFetcherThread(brokerAndFetcherId.fetcherId, brokerAndFetcherId.broker)fetcherThreadMap.put(brokerIdAndFetcherId, fetcherThread)fetcherThread.start()fetcherThread}for ((brokerAndFetcherId, initialFetchOffsets) <- partitionsPerFetcher) {val brokerIdAndFetcherId = BrokerIdAndFetcherId(brokerAndFetcherId.broker.id, brokerAndFetcherId.fetcherId)val fetcherThread = fetcherThreadMap.get(brokerIdAndFetcherId) match {case Some(currentFetcherThread) if currentFetcherThread.leader.brokerEndPoint() == brokerAndFetcherId.broker =>// 如果多个分区的leader在一个broker上,那么重复使用fetcher线程,减少连接数currentFetcherThreadcase Some(f) =>f.shutdown()addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)case None =>addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)}//向线程添加分区时,会删除失败的分区addPartitionsToFetcherThread(fetcherThread, initialFetchOffsets)}}
2、ReplicaFetcherManager
接下来让我们看看为每一个broker创建的拉取线程是什么
override def createFetcherThread(fetcherId: Int, sourceBroker: BrokerEndPoint): ReplicaFetcherThread = {//.......new ReplicaFetcherThread(threadName, leader, brokerConfig, failedPartitions, replicaManager,quotaManager, logContext.logPrefix, metadataVersionSupplier)}
3、ReplicaFetcherThread
下面我们看看这个拉取线程都做了什么
首先我们看下他们的继承关系
ReplicaFetcherThread extends AbstractFetcherThread extends ShutdownableThread
1、ShutdownableThread
拉取线程会不断的向leader拉取数据
public void run() {//......while (isRunning())doWork();//......}
2、AbstractFetcherThread
override def doWork(): Unit = {maybeTruncate()maybeFetch()}private def maybeFetch(): Unit = {val fetchRequestOpt = inLock(partitionMapLock) {//....}//依次向leader所在的broker发送拉取请求fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>processFetchRequest(sessionPartitions, fetchRequest)}}private def processFetchRequest(...): Unit = {val partitionsWithError = mutable.Set[TopicPartition]()val divergingEndOffsets = mutable.Map.empty[TopicPartition, EpochEndOffset]var responseData: Map[TopicPartition, FetchData] = Map.emptytry {//发送拉取请求trace(s"Sending fetch request $fetchRequest")responseData = leader.fetch(fetchRequest)} catch {、、....}if (responseData.nonEmpty) {//处理拉取到的数据//拉取到的数据是leader位于同一个broker的不同分区的数据responseData.forKeyValue { (topicPartition, partitionData) =>Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>//当有挂起的获取请求时,分区可能会被删除、重新添加或截断。//在这种情况下,我们只想在分区状态已准备好进行获取并且当前偏移量与请求的偏移量相同的情况下处理获取响应。val fetchPartitionData = sessionPartitions.get(topicPartition)if (fetchPartitionData != null && fetchPartitionData.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {Errors.forCode(partitionData.errorCode) match {//正常的没有错误的获取到了数据case Errors.NONE =>try {if (leader.isTruncationOnFetchSupported && FetchResponse.isDivergingEpoch(partitionData)) {//如果存在分叉纪元,我们会截断副本的日志,但我们不会处理分区数据,以便在截断实际完成之前不更新低/高水印。//这些将在下次获取时更新。divergingEndOffsets += topicPartition -> new EpochEndOffset().setPartition(topicPartition.partition).setErrorCode(Errors.NONE.code).setLeaderEpoch(partitionData.divergingEpoch.epoch).setEndOffset(partitionData.divergingEpoch.endOffset)} else {//一旦我们将分区数据传递给子类,我们就不能再在这个线程中弄乱它了val logAppendInfoOpt = processPartitionData(topicPartition,currentFetchState.fetchOffset,partitionData)logAppendInfoOpt.foreach { logAppendInfo =>val validBytes = logAppendInfo.validBytes//调整下一次请求数据的offsetval nextOffset = if (validBytes > 0) logAppendInfo.lastOffset + 1 else currentFetchState.fetchOffset//相较于 HW 滞后了多少//如果不是0 就需要不减速的去 拉取数据 val lag = Math.max(0L, partitionData.highWatermark - nextOffset)fetcherLagStats.getAndMaybePut(topicPartition).lag = lag//ReplicaDirAlterThread在处理分区数据后可能已从分区状态中删除topicPartitionif ((validBytes > 0 || currentFetchState.lag.isEmpty) && partitionStates.contains(topicPartition)) {val lastFetchedEpoch =if (logAppendInfo.lastLeaderEpoch.isPresent) logAppendInfo.lastLeaderEpoch.asScala else currentFetchState.lastFetchedEpoch//仅当processPartitionData期间没有异常时才更新partitionStateval newFetchState = PartitionFetchState(currentFetchState.topicId, nextOffset, Some(lag),currentFetchState.currentLeaderEpoch, state = Fetching, lastFetchedEpoch)根据拉回来的数据,更新每个分区的状态,为下一次拉取提供最新的参数partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)if (validBytes > 0) fetcherStats.byteRate.mark(validBytes)}}}}catch{...}case Errors.OFFSET_OUT_OF_RANGE => ...//....各种情况的错误...}}}}}if (divergingEndOffsets.nonEmpty)truncateOnFetchResponse(divergingEndOffsets)if (partitionsWithError.nonEmpty) {handlePartitionsWithErrors(partitionsWithError, "processFetchRequest")}}
三、leader处理follower的fetch请求
《Kafka-broker粗粒度启动流程》博客中我们了解了KafkaApis中有各种api和对应处理逻辑,其中FETCH请求对应了处理Leader处理Follower的数据拉取请求,下面我们跟着源码来看下处理细节
class KafkaApis(//......request.header.apiKey match {case ApiKeys.PRODUCE => handleProduceRequest(request, requestLocal)case ApiKeys.FETCH => handleFetchRequest(request)......}
}
1、handleFetchRequest
def handleFetchRequest(request: RequestChannel.Request): Unit = {//获取请求体val fetchRequest = request.body[FetchRequest]//需要拉取哪些 topic 的数据//返回类型是:LinkedHashMap<TopicIdPartition, PartitionData>val fetchData = fetchRequest.fetchData(topicNames)//这些 topic 的数据在这次请求中不做拉取val forgottenTopics = fetchRequest.forgottenTopics(topicNames)//拉取上下文val fetchContext = fetchManager.newContext(fetchRequest.version,fetchRequest.metadata,fetchRequest.isFromFollower,fetchData,forgottenTopics,topicNames)//错误的val erroneous = mutable.ArrayBuffer[(TopicIdPartition, FetchResponseData.PartitionData)]()//有兴趣的val interesting = mutable.ArrayBuffer[(TopicIdPartition, FetchRequest.PartitionData)]()//判断拉取请求来源:follower 或者 consumer 并根据topic纬度进行过滤if (fetchRequest.isFromFollower) {//数据请求来自于 Followerif (authHelper.authorize(request.context, CLUSTER_ACTION, CLUSTER, CLUSTER_NAME)) {fetchContext.foreachPartition { (topicIdPartition, data) =>if (topicIdPartition.topic == null)erroneous += topicIdPartition -> FetchResponse.partitionResponse(topicIdPartition, Errors.UNKNOWN_TOPIC_ID)else if (!metadataCache.contains(topicIdPartition.topicPartition))erroneous += topicIdPartition -> FetchResponse.partitionResponse(topicIdPartition, Errors.UNKNOWN_TOPIC_OR_PARTITION)else//有权限,且topic不为null 且存在于元数据中的topic才会进行后续处理interesting += topicIdPartition -> data}} else {fetchContext.foreachPartition { (topicIdPartition, _) =>erroneous += topicIdPartition -> FetchResponse.partitionResponse(topicIdPartition, Errors.TOPIC_AUTHORIZATION_FAILED)}}}else{//先不做分析}def maybeDownConvertStorageError(error: Errors): Errors = {...}def maybeConvertFetchedData(...){...}def processResponseCallback(...){...}if (interesting.isEmpty) {//如果请求的topic都不符合规范,那么直接返回nullprocessResponseCallback(Seq.empty)} else {//窗口拉取的最大字节限制val maxQuotaWindowBytes = if (fetchRequest.isFromFollower)Int.MaxValueelsequotas.fetch.getMaxValueInQuotaWindow(request.session, clientId).toIntval fetchMaxBytes = Math.min(Math.min(fetchRequest.maxBytes, config.fetchMaxBytes), maxQuotaWindowBytes)val fetchMinBytes = Math.min(fetchRequest.minBytes, fetchMaxBytes)val clientMetadata: Optional[ClientMetadata] = if (versionId >= 11) {...}else{...}val params = new FetchParams(...)//涉及到数据的拉取,都需要通过replicaManager来操作//processResponseCallback 是回调函数,处理完后再看它replicaManager.fetchMessages(params = params,fetchInfos = interesting,quota = replicationQuota(fetchRequest),responseCallback = processResponseCallback,)}}
2、replicaManager.fetchMessages
从副本中获取消息,并等待获取足够的数据并返回;当满足超时或所需的获取信息时,回调函数(processResponseCallback)将被触发。
消费者可以从任何副本中获取,但follower只能从leader那里获取。
def fetchMessages(params: FetchParams,//logReadResults 类型是:Seq[(TopicIdPartition, LogReadResult)]val logReadResults = readFromLog(params, fetchInfos, quota, readFromPurgatory = false)//必须从远程存储读取的第一个主题分区var remoteFetchInfo: Optional[RemoteStorageFetchInfo] = Optional.empty()//刚刚从本地读取的结果val logReadResultMap = new mutable.HashMap[TopicIdPartition, LogReadResult]//循环读取topicIdPartition 判断是否从本地读取成功,没有读取成功的可能错误可能需要让follower去其他follower的副本读取logReadResults.foreach { case (topicIdPartition, logReadResult) =>//......}//三个分支// 1、重新分批去最好的副本的broker拉取数据// 2、从leader成功拉取,正常,或者部分出错,只返回正常的数据// 3、进入炼狱,等待新的数据产生if (!remoteFetchInfo.isPresent && (params.maxWaitMs <= 0 || fetchInfos.isEmpty || bytesReadable >= params.minBytes || errorReadingData ||hasDivergingEpoch || hasPreferredReadReplica)) {val fetchPartitionData = logReadResults.map { case (tp, result) =>val isReassignmentFetch = params.isFromFollower && isAddingReplica(tp.topicPartition, params.replicaId)//重新分批拉取tp -> result.toFetchPartitionData(isReassignmentFetch)}responseCallback(fetchPartitionData)} else {//根据读取的结果构造提取结果val fetchPartitionStatus = new mutable.ArrayBuffer[(TopicIdPartition, FetchPartitionStatus)]fetchInfos.foreach { case (topicIdPartition, partitionData) =>logReadResultMap.get(topicIdPartition).foreach(logReadResult => {val logOffsetMetadata = logReadResult.info.fetchOffsetMetadatafetchPartitionStatus += (topicIdPartition -> FetchPartitionStatus(logOffsetMetadata, partitionData))})}if (remoteFetchInfo.isPresent) {val maybeLogReadResultWithError = processRemoteFetch(remoteFetchInfo.get(), params, responseCallback, logReadResults, fetchPartitionStatus)if (maybeLogReadResultWithError.isDefined) {// 如果在调度远程获取任务时出错,请返回我们当前拥有的内容(从本地日志段读取的其他主题分区的数据),以及我们无法从远程存储读取的主题分区的错误val partitionToFetchPartitionData = buildPartitionToFetchPartitionData(logReadResults, remoteFetchInfo.get().topicPartition, maybeLogReadResultWithError.get)responseCallback(partitionToFetchPartitionData)}}else{//进入炼狱// 如果没有足够的数据来响应,也没有远程数据,我们将让获取请求等待新数据。//.....delayedFetchPurgatory.tryCompleteElseWatch(delayedFetch, delayedFetchKeys)}}}
3、本地读取数据
其实此时本地并没有读取数据,而是找到了目标数据的位置,并封装成了对象,在调用NIO发出去时才真正的去读数据,由于拉取的数据一般都是新鲜的数据(前几十毫秒或百毫秒),此时leader的数据应该还在pagecache中,因此follower同步数据可能不需要与磁盘的交互就可以完成。
调用关系如下:
readFromLog() --> 内部的read() --> Partition.fetchRecords() --> Partition.readRecords() ---> UnifiedLog.read() --> LocalLog.read() --> LogSegment.read()
LogSegment.read() 最后封装了一个对象 FetchDataInfo 来返回
1、LogSegment
//由文件支持的 Records 实现。可以将可选的开始和结束位置应用于此实例,以允许对一系列日志记录进行切片private final FileRecords log;public FetchDataInfo read(...) throws IOException {//......return new FetchDataInfo(offsetMetadata, log.slice(startPosition, fetchSize),adjustedMaxSize < startOffsetAndSize.size, Optional.empty());}
2、FileRecords
private final FileChannel channel;private volatile File file;public FileRecords slice(int position, int size) throws IOException {int availableBytes = availableBytes(position, size);int startPosition = this.start + position;//这里并没有读,从leader处理producer数据的原理来看,leader是将数据写入了 pagecache ,此时距离生产的时间应该很近,//当然除了一种情况(一个borker刚刚启动,它需要不断的去追平leader的数据,这时数据可能存在磁盘中)//如果此时数据在pagecache中,那么可以直接将内存中的片段数据发送到网卡//1.传统I/O//硬盘—>内核缓冲区—>用户缓冲区—>内核socket缓冲区—>协议引擎//2.sendfile//硬盘—>内核缓冲区—>内核socket缓冲区—>协议引擎//3.sendfile( DMA 收集拷贝)//硬盘—>内核缓冲区—>协议引擎return new FileRecords(file, channel, startPosition, startPosition + availableBytes, true);}
4、回调函数执行
回调函数执行的时候也会分follower还是consumer
def processResponseCallback(responsePartitionData: Seq[(TopicIdPartition, FetchPartitionData)]): Unit = {//......if (fetchRequest.isFromFollower) {//.....//参数request 就是 follower 请求过来使用的 request ,NIO中的channle中是可读可写的requestHelper.sendResponseExemptThrottle(request, createResponse(0), Some(updateConversionStats))}else {//....requestChannel.sendResponse(request, createResponse(maxThrottleTimeMs), Some(updateConversionStats))}}
最终会将响应消息放入队列中,由rpc层调用NIO去处理队列中的消息,发送给follower或者consumer,后面我们单独拿一章来分析kafka的网络层
四、总结
1、broker启动,或者topic创建造成每个分区对于broker有的是leader,有的是follower
2、producer向leader打数据,并把数据写入leader所在broker的pagecache中
3、每个broker的作为follower角色的分区会开启一个死循环不停的去leader读取数据
4、leader接收到follower的拉取请求会从本地读取数据(此时大概率就是从pagecache中读取数据,不会触发读磁盘)
5、leader向follower返回响应(数据、让follower去其他broker的副本拉取、重试拉取)。
6、如果判断follower数据已经追上leader,就先不做返回,做延迟拉取操作,减少带宽消耗。