单细胞转录组 —— STARsolo 原始数据处理

单细胞转录组 —— STARsolo 原始数据处理实战

前言

前面我们已经介绍了几种原始数据处理工具，最后再介绍一种多平台兼容的快速定量工具 —— STARsolo。

主要使用的还是 STAR 比对软件，只是增加了更多对单细胞数据的处理，不同平台数据的差异，也只是在参数设置上。

实战

软件可以直接用 conda 进行安装

conda create -n STAR -c bioconda -c conda-forge star seqtk

下载人类参考基因组

wget http://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz

或者小鼠参考基因组

wget ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-98/gtf/mus_musculus/Mus_musculus.GRCm38.98.gtf.gz

解压文件

gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip gencode.v32.primary_assembly.annotation.gtf.gz
gunzip Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
gunzip Mus_musculus.GRCm38.98.gtf.gz

构建索引

构建人类参考基因组索引

STAR \--runThreadN 20 \--runMode genomeGenerate \# 索引保存路径--genomeDir human \# fa 文件路径--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \# gtf文件路径--sjdbGTFfile gencode.v32.primary_assembly.annotation.gtf

类似地，可以构建小鼠参考基因组索引

STAR \--runThreadN 20 \--runMode genomeGenerate \# 索引保存路径--genomeDir mmu \# fa 文件路径--genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa \# gtf文件路径--sjdbGTFfile Mus_musculus.GRCm38.98.gtf

scRNA-seq

从 SRA 数据库中下载 PRJNA666791 项目的 10x scRNA-seq 数据进行分析

prefetch --option-file SRR_Acc_List.txt --output-directory sra

解压数据

ls sra/*/*.sra | xargs fastq-dump --split-files --gzip -O raw

10x barcode 文件可以在下载的 cellranger 软件中找到，例如

ls /path/to/cellranger-8.0.0/lib/python/cellranger/barcodes

参数配置，需要根据所使用数据的信息进行调整

```bash
# 白名单
BC=737K-august-2016.txt
# barcode 的长度
CBLEN=16
# UMI 的长度
UMILEN=10
# 文库方向
STRAND=Forward  # Forward/Reverse/UnstrandedFQDIR="$(pwd)/raw"
TAG="GSM4812353"
R1=""
R2=""
if [[ `find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "_1\.f.*q"` != "" ]]
then R1=`find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "_1\.f.*q" | sort | tr '\n' ',' | sed "s/,$//g"`R2=`find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "_2\.f.*q" | sort | tr '\n' ',' | sed "s/,$//g"`
elif [[ `find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "R1\.f.*q"` != "" ]]
thenR1=`find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "R1\.f.*q" | sort | tr '\n' ',' | sed "s/,$//g"`R2=`find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "R2\.f.*q" | sort | tr '\n' ',' | sed "s/,$//g"`
elif [[ `find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "_R1_.*\.f.*q"` != "" ]]
thenR1=`find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "_R1_.*\.f.*q" | sort | tr '\n' ',' | sed "s/,$//g"`R2=`find $FQDIR/* | grep -P "\/$TAG[\/\._]" | grep "_R2_.*\.f.*q" | sort | tr '\n' ',' | sed "s/,$//g"`
else >&2 echo "ERROR: No appropriate fastq files were found! Please check file formatting, and check if you have set the right FQDIR."exit 1
fi

比对和定量分析，注意指定文件夹时末尾需要带上路径分隔符（/），例如 output/

STAR --runThreadN 16 \--genomeDir mmu \--runDirPerm All_RWX \--readFilesCommand gunzip -c\--outSAMtype BAM Unsorted \  # 最好不要在这里排序--readFilesIn $R2 $R1 \--soloType CB_UMI_Simple \--soloCBwhitelist $BC --soloBarcodeReadLength 0 \--soloCBlen $CBLEN \--soloUMIstart $((CBLEN+1)) \--soloUMIlen $UMILEN \--soloStrand $STRAND \--soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \--soloUMIfiltering MultiGeneUMI_CR \--soloCellFilter EmptyDrops_CR --clipAdapterType CellRanger4 \--outFilterScoreMin 30 \--soloFeatures Gene GeneFull Velocyto \--soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx \--soloMultiMappers EM --outReadsUnmapped Fastx

如果出现类似的错误

EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: Expected bin size=2750819958 ; size on disk=310984704 ; bin number=48

不要设置排序选项 "outSAMtype": "BAM SortedByCoordinate"，可以跑完之后，再用 samtools sort

结果文件的结果如下，包含 Gene、GeneFull 和 Velocyto 三个文件夹，它们的文件结构是一模一样的。

inDrops

如果跑过前面的流程，可以使用前面下载好的数据。

GSE111672 数据集中包含 6 例原发性胰腺癌组织的单细胞 RNA 测序和空间转录组学，

我们随便选择一个样本 GSM3036909 下载原始数据

prefetch SRR6825055

解压

fastq-dump SRR6825055/SRR6825055.sra --split-files --gzip -O SRR6825055

其中 R1 长度为 35，R2 长度为 51，是 Indrop-seq 的 V2 版，与 V1 相反，所以需要调换文件顺序

barcode 文件可以从 inDrops barcode lists 中下载。

比对和定量

STAR --runThreadN 8 \--genomeDir human \--readFilesIn SRR6825055/SRR6825055_1.fastq.gz SRR6825055/SRR6825055_2.fastq.gz \--readFilesCommand gunzip -c --outReadsUnmapped None \--outSAMattrRGline ID:indrop SM:indrop \--runDirPerm All_RWX --outSAMtype BAM Unsorted \--soloFeatures Gene GeneFull \--soloCBwhitelist gel_barcode1_list.txt gel_barcode2_list.txt \--soloAdapterSequence GAGTGATTGCTTGTGACGCCTT \--soloType CB_UMI_Complex --soloAdapterMismatchesNmax 3 \--soloCBmatchWLtype 1MM --soloCBposition 0_0_2_-1 3_1_3_8 \--soloUMIposition 3_9_3_14 --outTmpDir /tmp/tmplm9atu4a/STARtmp \--outFileNamePrefix solo/indrop/ \--soloOutFileNames solo/ features.tsv barcodes.tsv matrix.mtx

Drop-seq

数据集中 GSE178612 研究的是 FoxM1 与 Rb 基因在小鼠乳腺癌中的相互作用。

我们选择其中一个样本 GSM5394388，下载其原始数据

prefetch SRR14872449

fastq-dump SRR14872449/SRR14872449.sra --split-files --gzip -O SRR14872449

比对和定量

STAR --runThreadN 8 \--genomeDir mmu \--readFilesIn SRR14872449/SRR14872449_2.fastq.gz SRR14872449/SRR14872449_1.fastq.gz \--readFilesCommand gunzip -c \--outReadsUnmapped None \--outSAMattrRGline ID:drop SM:drop --runDirPerm All_RWX \--outSAMtype BAM Unsorted \--soloFeatures Gene GeneFull \--soloType CB_UMI_Simple --soloCBwhitelist None -\-soloCBstart 1 --soloCBlen 12 \--soloUMIstart 13 --soloUMIlen 8 \--soloBarcodeReadLength 0 --outTmpDir /tmp/tmpkyaiuqwz/STARtmp \--outFileNamePrefix solo/drop/ \--soloOutFileNames solo/ features.tsv barcodes.tsv matrix.mtx

SMART-seq2

我们从研究人类皮质球体内星形胶质细胞的成熟数据 GSE99951 中，下载 GSM2665701 样本进行分析

prefetch SRR5676730

解压

fastq-dump SRR5676730/SRR5676730.sra --split-files --gzip -O SRR5676730

可以先对数据进行去接头处理，可以选择常用的 trimmomatic 软件

conda install -c bioconda trimmomatic

去接头

trimmomatic PE -threads 8 \SRR5676730/SRR5676730_1.fastq.gz SRR5676730/SRR5676730_2.fastq.gz \SRR5676730/SRR5676730_1.paired.fastq.gz SRR5676730/SRR5676730_1.unpaired.fastq.gz \SRR5676730/SRR5676730_2.paired.fastq.gz SRR5676730/SRR5676730_2.unpaired.fastq.gz \ILLUMINACLIP:/path/to/adapters/TruSeq3-PE-2.fa:2:30:10 \LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

表达定量

STAR --runThreadN 8 \--genomeDir human \--readFilesIn SRR5676730/SRR5676730_1.paired.fastq.gz SRR5676730/SRR5676730_2.paired.fastq.gz \--readFilesCommand gunzip \-c --outReadsUnmapped None \--outSAMattrRGline ID:smart2 SM:smart2 \--runDirPerm All_RWX --outSAMtype BAM Unsorted \--soloFeatures Gene GeneFull \--soloType SmartSeq --soloUMIdedup Exact \--soloStrand Unstranded --limitOutSJcollapsed 10000000 \--soloCellFilter None \--outTmpDir /tmp/tmp8n8c3_75/STARtmp \--outFileNamePrefix solo/smart2/ \--soloOutFileNames solo/ features.tsv barcodes.tsv matrix.mtx

CEL-Seq2

使用出生后第 4 天小鼠结节神经节胶质细胞单细胞水平的基因数据集 GSE237947。

选择样本 GSM7656793，下载

prefetch SRR25386888

解压数据

fastq-dump SRR25386888/SRR25386888.sra --split-files --gzip -O SRR25386888

定量分析

STAR --runThreadN 8 \--genomeDir mmu \--readFilesIn SRR25386888/SRR25386888_2.fastq.gz SRR25386888/SRR25386888_1.fastq.gz \--readFilesCommand gunzip -c \--outReadsUnmapped None --runDirPerm All_RWX \--outSAMattrRGline ID:cel SM:cel \--soloBarcodeMate 2 \  # 指定 barcode 在哪个序列上--outSAMtype BAM Unsorted \--soloFeatures Gene GeneFull \--soloCBwhitelist None --soloType CB_UMI_Simple \--clip5pNbases 12 0 \--soloUMIstart 1 --soloUMIlen 6 \--soloCBstart 7 --soloCBlen 6 \--soloUMIdedup Exact \--outTmpDir /tmp/tmp4fqrnsfm/STARtmp \--outFileNamePrefix solo/cel/ \--soloOutFileNames solo/ features.tsv barcodes.tsv matrix.mtx

BD Rhapsody

GSE241462 收集了小鼠胚胎发育几个阶段种约 200~250 个囊胚和 50 个 E5.5 阶段的胚胎。

选择 GSM7729310 样本进行分析

prefetch SRR25734060

解压数据

fastq-dump SRR25734060/SRR25734060.sra --split-files --gzip -O SRR25734060

定量分析

STAR --runThreadN 8 \ --genomeDir mmu \--readFilesIn SRR25734060/SRR25734060_2.fastq.gz SRR25734060/SRR25734060_1.fastq.gz \--readFilesCommand gunzip \-c --outReadsUnmapped None \--outSAMattrRGline ID:bd_rhapsody SM:bd_rhapsody \--runDirPerm All_RWX --outSAMtype BAM Unsorted \--soloFeatures Gene GeneFull \--soloCBwhitelist bd_rhapsody_barcode1.txt bd_rhapsody_barcode2.txt bd_rhapsody_barcode3.txt \--soloType CB_UMI_Complex \--soloUMIlen 8 --soloCBmatchWLtype 1MM \--soloCBposition 0_0_0_8 0_21_0_29 0_43_0_51 \--soloUMIposition 0_52_0_59 \--outTmpDir /tmp/tmpsdqwfkpz/STARtmp \--outFileNamePrefix solo/bd_rhapsody/ \--soloOutFileNames solo/ features.tsv barcodes.tsv matrix.mtx