欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 教育 > 培训 > MaptrV2代码阅读

MaptrV2代码阅读

2025/2/20 10:13:20 来源:https://blog.csdn.net/qq_41131535/article/details/139676227  浏览:    关键词:MaptrV2代码阅读

一 数据处理(后续补充)

二 模型结构

2.1 Backbone+Neck

  • 这里输入不加时序的单帧图片,一共六张,输入图片大小为 B ∗ 6 ∗ 3 ∗ 480 ∗ 800 ( B 是 b a t c h s i z e ) B*6*3*480*800 (B是batchsize) B63480800Bbatchsize,先走grid_mask数据增强(参考https://blog.csdn.net/u013685264/article/details/122667456),采用基础resnet50作为backbone,得到最后32倍下采样特征 B ∗ 6 ∗ 2048 ∗ 15 ∗ 25 B*6*2048*15*25 B620481525,在进过neck(主要是两个Conv2d 进行降维),得到输出 B ∗ 6 ∗ 256 ∗ 15 ∗ 25 B*6*256*15*25 B62561525

2.2 BEV特征

  • 目前bev特征生成,主流的主要是bevformer和LSS,针对这两种方式,后续补充,生成bev特征 B ∗ 2000 ∗ 256 ( 2000 是对应 200 ∗ 100 B E V 空间大小( h ∗ w )) B*2000*256(2000是对应200*100 BEV空间大小(h*w)) B20002562000是对应200100BEV空间大小(hw)),LSS还会生成对应depth特征 B ∗ 6 ∗ 68 ∗ 15 ∗ 25 B*6*68*15*25 B6681525 用作后续深度监督

2.3 Decoder模块

  • 输入query,采用instance_pts形式,即instance(instance一共有350个,主要是50+300,50是one2one,300是后续one2many多扩展的6倍)和每个instance对应的20个点,分开初始化,最终得到object_query_embeds 7000 ∗ 512 (其中 7000 是对应 350 ∗ 20 , 512 是对应 q u e r y 和 q u e r y − p o s 合到一起的,也就是 q u e r y 和 q u e r y − p o s 特征是 350 ∗ 20 ∗ 256 ) 7000*512(其中7000是对应350*20,512是对应query和query-pos合到一起的,也就是query和query-pos特征是350*20*256) 7000512(其中7000是对应35020512是对应queryquerypos合到一起的,也就是queryquerypos特征是35020256
  • 这里设置了个self_attn_mask,大小是 350 ∗ 350 350*350 350350,就是左上角的 50 ∗ 50 50*50 5050和右下角的 300 ∗ 300 300*300 300300是False,是为了隔开one2one和one2many的query,互相不干扰

2.3.1 decoder过程,主要参考deformable attention

MapTRDecoder((layers): ModuleList((0): DecoupledDetrTransformerDecoderLayer((attentions): ModuleList((0): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(1): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(2): CustomMSDeformableAttention((dropout): Dropout(p=0.1, inplace=False)(sampling_offsets): Linear(in_features=256, out_features=64, bias=True)(attention_weights): Linear(in_features=256, out_features=32, bias=True)(value_proj): Linear(in_features=256, out_features=256, bias=True)(output_proj): Linear(in_features=256, out_features=256, bias=True)))(ffns): ModuleList((0): FFN((activate): ReLU(inplace=True)(layers): Sequential((0): Sequential((0): Linear(in_features=256, out_features=512, bias=True)(1): ReLU(inplace=True)(2): Dropout(p=0.1, inplace=False))(1): Linear(in_features=512, out_features=256, bias=True)(2): Dropout(p=0.1, inplace=False))(dropout_layer): Identity()))(norms): ModuleList((0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)))(1): DecoupledDetrTransformerDecoderLayer((attentions): ModuleList((0): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(1): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(2): CustomMSDeformableAttention((dropout): Dropout(p=0.1, inplace=False)(sampling_offsets): Linear(in_features=256, out_features=64, bias=True)(attention_weights): Linear(in_features=256, out_features=32, bias=True)(value_proj): Linear(in_features=256, out_features=256, bias=True)(output_proj): Linear(in_features=256, out_features=256, bias=True)))(ffns): ModuleList((0): FFN((activate): ReLU(inplace=True)(layers): Sequential((0): Sequential((0): Linear(in_features=256, out_features=512, bias=True)(1): ReLU(inplace=True)(2): Dropout(p=0.1, inplace=False))(1): Linear(in_features=512, out_features=256, bias=True)(2): Dropout(p=0.1, inplace=False))(dropout_layer): Identity()))(norms): ModuleList((0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)))(2): DecoupledDetrTransformerDecoderLayer((attentions): ModuleList((0): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(1): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(2): CustomMSDeformableAttention((dropout): Dropout(p=0.1, inplace=False)(sampling_offsets): Linear(in_features=256, out_features=64, bias=True)(attention_weights): Linear(in_features=256, out_features=32, bias=True)(value_proj): Linear(in_features=256, out_features=256, bias=True)(output_proj): Linear(in_features=256, out_features=256, bias=True)))(ffns): ModuleList((0): FFN((activate): ReLU(inplace=True)(layers): Sequential((0): Sequential((0): Linear(in_features=256, out_features=512, bias=True)(1): ReLU(inplace=True)(2): Dropout(p=0.1, inplace=False))(1): Linear(in_features=512, out_features=256, bias=True)(2): Dropout(p=0.1, inplace=False))(dropout_layer): Identity()))(norms): ModuleList((0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)))(3): DecoupledDetrTransformerDecoderLayer((attentions): ModuleList((0): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(1): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(2): CustomMSDeformableAttention((dropout): Dropout(p=0.1, inplace=False)(sampling_offsets): Linear(in_features=256, out_features=64, bias=True)(attention_weights): Linear(in_features=256, out_features=32, bias=True)(value_proj): Linear(in_features=256, out_features=256, bias=True)(output_proj): Linear(in_features=256, out_features=256, bias=True)))(ffns): ModuleList((0): FFN((activate): ReLU(inplace=True)(layers): Sequential((0): Sequential((0): Linear(in_features=256, out_features=512, bias=True)(1): ReLU(inplace=True)(2): Dropout(p=0.1, inplace=False))(1): Linear(in_features=512, out_features=256, bias=True)(2): Dropout(p=0.1, inplace=False))(dropout_layer): Identity()))(norms): ModuleList((0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)))(4): DecoupledDetrTransformerDecoderLayer((attentions): ModuleList((0): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(1): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(2): CustomMSDeformableAttention((dropout): Dropout(p=0.1, inplace=False)(sampling_offsets): Linear(in_features=256, out_features=64, bias=True)(attention_weights): Linear(in_features=256, out_features=32, bias=True)(value_proj): Linear(in_features=256, out_features=256, bias=True)(output_proj): Linear(in_features=256, out_features=256, bias=True)))(ffns): ModuleList((0): FFN((activate): ReLU(inplace=True)(layers): Sequential((0): Sequential((0): Linear(in_features=256, out_features=512, bias=True)(1): ReLU(inplace=True)(2): Dropout(p=0.1, inplace=False))(1): Linear(in_features=512, out_features=256, bias=True)(2): Dropout(p=0.1, inplace=False))(dropout_layer): Identity()))(norms): ModuleList((0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)))(5): DecoupledDetrTransformerDecoderLayer((attentions): ModuleList((0): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(1): MultiheadAttention((attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(proj_drop): Dropout(p=0.0, inplace=False)(dropout_layer): Dropout(p=0.1, inplace=False))(2): CustomMSDeformableAttention((dropout): Dropout(p=0.1, inplace=False)(sampling_offsets): Linear(in_features=256, out_features=64, bias=True)(attention_weights): Linear(in_features=256, out_features=32, bias=True)(value_proj): Linear(in_features=256, out_features=256, bias=True)(output_proj): Linear(in_features=256, out_features=256, bias=True)))(ffns): ModuleList((0): FFN((activate): ReLU(inplace=True)(layers): Sequential((0): Sequential((0): Linear(in_features=256, out_features=512, bias=True)(1): ReLU(inplace=True)(2): Dropout(p=0.1, inplace=False))(1): Linear(in_features=512, out_features=256, bias=True)(2): Dropout(p=0.1, inplace=False))(dropout_layer): Identity()))(norms): ModuleList((0): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(3): LayerNorm((256,), eps=1e-05, elementwise_affine=True))))
)
  • 根据query_pos 走一个线性变化,得到reference_points B ∗ 7000 ∗ 2 B*7000*2 B70002,然后走sigmoid,得到初始化init_reference_out B ∗ 7000 ∗ 2 B*7000*2 B70002
  • 输入的img_neck特征,加上cams_embeds 和 level_embeds,得到feat_flatten 6 ∗ 375 ∗ B ∗ 256 ( 375 是 15 ∗ 25 ) 6*375*B*256 (375是15*25) 6375B2563751525
  • 进入decoder过程
    在这里插入图片描述
  • 这里经过6层decoder,每一层有self-attention,layer_norm,self-attention,layer_norm,cross-attention,layer_norm,FFN,layer_norm
  • 第一次self-attention是nn.MultiheadAttention,输入是query和query_pos,这里就用到了前面的 self_attn_mask,在nn.MultiheadAttention模块中,mask=1-attn_mask,对应上面的设置
  • 第二次self-attention,其中attn_mask设置为None
  • cross-attention,采用CustomMSDeformableAttention,输入query,key=None,value是对应的bev_embed;value经过一个Linear,得到最终输入value,query经过Linear生成多头的sampling_offsets B ∗ 7000 ∗ 8 ∗ 1 ∗ 4 ∗ 2 ( 7000 表示是 350 ∗ 20 个实例, 8 是 m u l t h e a d , 1 是只有一个 l e v e l , 4 是生成 4 个点, 2 是对应的 x y 偏移) B*7000*8*1*4*2(7000表示是350*20个实例,8是mult_head,1是只有一个level,4是生成4个点,2是对应的xy偏移) B700081427000表示是35020个实例,8multhead1是只有一个level4是生成4个点,2是对应的xy偏移);query经过Linear生成多头的attention_weights B ∗ 7000 ∗ 8 ∗ 1 ∗ 4 ( 7000 表示是 350 ∗ 20 个实例, 8 是 m u l t h e a d , 1 是只有一个 l e v e l , 4 是生成 4 个点) B*7000*8*1*4(7000表示是350*20个实例,8是mult_head,1是只有一个level,4是生成4个点) B70008147000表示是35020个实例,8multhead1是只有一个level4是生成4个点),在经过softmax;通过reference_points+sampling_offsets/shape,得到最终的sampling_locations,整个过程就是通过reference_ponits,加上4个offsets,得到最终4个点的位置,然后在value上面进行双线性插值得到特征,然后在乘以attention_weights,在求和得到最终output B ∗ 7000 ∗ 256 B*7000*256 B7000256,在经过Linear以及和输入的query做残差连接,得到最终cross-attention输出 7000 ∗ B ∗ 256 7000*B*256 7000B256
  • FFN,主要参考如下
    在这里插入图片描述
    在这里插入图片描述
  • 得到最终output B ∗ 7000 ∗ 256 B*7000*256 B7000256,当成下一层的query输入,output经过reg_branches(Linear+Relu+Linear+Relu+Linear),得到新参考点的偏移,之后与初始输入的reference_points(经过逆sigmoid)相加之后得到new_reference_points,并经过sigmoid当成下一层的inference_points的输入
  • 最终经过6层之后,保留每一层的输出output和inference_points,后面计算损失
  • 对于每一层输出的output B ∗ 7000 ∗ 256 B*7000*256 B7000256,转换成 B ∗ 350 ∗ 20 ∗ 256 B*350*20*256 B35020256,并对第三维求平均,得到 B ∗ 350 ∗ 256 B*350*256 B350256经过cls_branches (Linear+LayerNorm+Relu+Linear+LayerNorm+Relu+Linear),得到最终分类结果 B ∗ 350 ∗ 3 B*350*3 B3503,一共只有3类;代码中会重新生成reference_points与上面生成reference_points相同,代码里面属于重复生成了,可以删除,最终得到点的坐标 B ∗ 7000 ∗ 2 (也就是 B ∗ 350 ∗ 20 ∗ 2 ,一共 350 个 i n s t a n c e ,一个 i n s t a n c e 对应 20 个点坐标) B*7000*2(也就是B*350*20*2,一共350个instance,一个instance对应20个点坐标) B70002(也就是B350202,一共350instance,一个instance对应20个点坐标),然后生成对应的外接矩形框和对应的20个点坐标
  • 这里采用辅助分割,第一个根据bev_embed,通过seg_head (Conv2d+Relu+Conv2d),得到在bev下的语义分割结果 B ∗ 1 ∗ 200 ∗ 100 B*1*200*100 B1200100;第二个根据feat_flatten B ∗ 6 ∗ 256 ∗ 15 ∗ 25 B*6*256*15*25 B62561525,通过pv_seg_head (Conv2d+Relu+Conv2d),得到原始6张pv图下的语义分割结果 B ∗ 6 ∗ 1 ∗ 15 ∗ 25 B*6*1*15*25 B611525

2.4 Loss计算

  • depth loss ,基于LSS计算的,后续补充
  • 对于输出一共350个instance,这里分成50个one2one,和300个one2many,对应one2many的gt_label也是相应复制6份

2.4.1 进行maptr_assigner

  • 这里以one2one计算为例,对于gt处理,目前一共三类,是车道线,边界线和人行横道,对于前面两类会增加正序和逆序,对于人行横道是环形,这里就是循环生成19个实例,对于前面两类不足19个就补-1,最终得到gt_shifts_pts_list N ∗ 19 ∗ 20 ∗ 2 ( N 表示一个输入里面包括 N 个 g t ) N*19*20*2 (N表示一个输入里面包括N个gt) N19202N表示一个输入里面包括Ngt
  • 计算loss,包括cls_loss (focal_loss),box_reg_los (L1 loss),pts_loss(倒角距离),iou_loss(giou loss)
  • 这里对于pts_loss,计算不同的是,会计算这50个实例和这19个新增的gt的loss,然后在这19个选择最小的一个作为最终loss计算
  • 流程就是计算所有loss,根据匈牙利匹配,选取1对1的gt和pred,然后计算最终loss

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词