image.png
wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz grep -v "#" Homo_sapiens.GRCh38.87.chr.gtf.gz >hg38.gtf.gz gunzip hg38.gtf.gz
做题之前我们先看看文件的内容是什么
image.png
gtf有9列
1.染色体名
2.注释信息的来源,比如”Genescan”、”Genbank”
等,可以为空,为空用”.”点号代替
3.注释信息的类型,比如Gene、cDNA、mRNA等,或者是SO对应的编号
4、5.开始和结束位置
6.得分,数字,是注释信息可能性的说明,可以是序列相似性比对时的E-values值或者基因预测是的P-values值。”.”表示为空。
7.序列的方向,
+表示正义链, -反义链 , ? 表示未知
8.阅读框:有数字0、1和2。0代表序列的第一个碱基为密码子的第一个碱基,1代表是密码子第二个,2代表第三个。
9.以多个键值对组成的注释信息描述,键与值之间用”=“;
以上图内容简单的理解一下
红线圈出来的范围一个gene信息,范围是11869-14409bp
在这个范围内有两个转录本(黄色部分):
transcript1 来自基因的11869-14409
这个范围包括3个exon
transcript2来自基因的12010-13670
这个范围包括6个exon
那么本次要做的事情就主要包括一下几个方面吧
1.统计各个染色体基因数量的分布,包括总的基因和各中类型基因数量的分布
2.统计一下基因转录本数量的分布
就这2个吧,依然是分为python,R,shell三种编程方法
1.python
1.统计各个染色体基因数量的分布,包括总的基因和各中类型基因数量的分布
给出我写的脚本
import sysimport collections args=sys.argv filename=args[1] count_gene=collections.OrderedDict()with open (filename) as fh: for line in fh: lineL=line.strip().split("\t") chr_num="chr"+lineL[0] type_info=lineL[2] if type_info=="gene": All_gene="all"+"_"+type_info if chr_num not in count_gene: count_gene[chr_num]={} count_gene[chr_num][All_gene]=0 count_gene[chr_num][All_gene]+=1 #计算all gene个数 descrip_info=lineL[8] descrip_info=descrip_info[:-1] #主要是有的biotype就是最后一部分了,这时候后面得到的b_type包含“;”,所以一开始就把最后的";"去掉 descrip_infoL=descrip_info.split("; ") biotype=descrip_infoL[4] biotypeL=biotype.split(" ") b_type=biotypeL[1] b_type=eval(b_type) if b_type not in count_gene[chr_num]: count_gene[chr_num][b_type]=0 count_gene[chr_num][b_type]+=1for chr_num,countAll in count_gene.items(): for k,v in countAll.items(): print(chr_num,k,v)
结果如下
chr1 protein_coding 2052 chr1 TEC 28 chr1 polymorphic_pseudogene 6 chr1 snRNA 219 chr1 antisense 505 chr1 pseudogene 5 chr1 sense_intronic 66 chr1 processed_transcript 31 chr1 all_gene 5194 chr1 sense_overlapping 17 chr1 processed_pseudogene 897 chr1 rRNA 69 chr1 snoRNA 68 chr1 miRNA 126 chr1 lincRNA 576 chr1 IG_V_pseudogene 1 chr1 unprocessed_pseudogene 206 chr1 transcribed_unitary_pseudogene 4 chr1 unitary_pseudogene 26 chr1 scaRNA 14 chr1 misc_RNA 192 chr1 3prime_overlapping_ncRNA 4 chr1 transcribed_processed_pseudogene 33 chr1 bidirectional_promoter_lncRNA 1 chr1 transcribed_unprocessed_pseudogene 48 chr2 protein_coding 1255 chr2 IG_C_gene 1 chr2 TEC 51 chr2 polymorphic_pseudogene 2 chr2 snRNA 157 chr2 antisense 364 chr2 IG_V_gene 46 chr2 pseudogene 1 chr2 transcribed_unitary_pseudogene 3 chr2 processed_transcript 42 chr2 miRNA 105 chr2 sense_overlapping 10 chr2 processed_pseudogene 751 chr2 rRNA 41 chr2 snoRNA 64 chr2 transcribed_unprocessed_pseudogene 39 chr2 all_gene 3971 chr2 lincRNA 577 chr2 IG_J_gene 5 chr2 IG_V_pseudogene 45 chr2 unprocessed_pseudogene 155 chr2 sense_intronic 37 chr2 unitary_pseudogene 7 chr2 scaRNA 7 chr2 misc_RNA 176 chr2 3prime_overlapping_ncRNA 7 chr2 scRNA 1 chr2 transcribed_processed_pseudogene 22 chr3 protein_coding 1076 chr3 TEC 26 chr3 polymorphic_pseudogene 2 chr3 snRNA 133 chr3 antisense 299 chr3 transcribed_unitary_pseudogene 9 chr3 processed_transcript 22 chr3 miRNA 77 chr3 sense_overlapping 7 chr3 processed_pseudogene 616 chr3 rRNA 29 chr3 snoRNA 56 chr3 sRNA 1 chr3 all_gene 3010 chr3 lincRNA 353 chr3 unprocessed_pseudogene 87 chr3 sense_intronic 29 chr3 unitary_pseudogene 9 chr3 scaRNA 2 chr3 misc_RNA 134 chr3 3prime_overlapping_ncRNA 1 chr3 transcribed_processed_pseudogene 19 chr3 transcribed_unprocessed_pseudogene 23 chr4 protein_coding 751 chr4 TEC 37 chr4 polymorphic_pseudogene 1 chr4 snRNA 116 chr4 antisense 207 chr4 pseudogene 1 chr4 sense_intronic 18 chr4 processed_transcript 16 chr4 miRNA 57 chr4 sense_overlapping 9 chr4 processed_pseudogene 561 chr4 ribozyme 1 chr4 rRNA 24 chr4 snoRNA 30 chr4 all_gene 2505 chr4 lincRNA 405 chr4 unprocessed_pseudogene 113 chr4 transcribed_unitary_pseudogene 2 chr4 unitary_pseudogene 2 chr4 scaRNA 1 chr4 misc_RNA 104 chr4 transcribed_processed_pseudogene 31 chr4 bidirectional_promoter_lncRNA 1 chr4 transcribed_unprocessed_pseudogene 17 chr5 protein_coding 876 chr5 all_gene 2868 chr5 lincRNA 490 chr5 rRNA 25 chr5 miRNA 66 chr5 unprocessed_pseudogene 57 chr5 transcribed_unitary_pseudogene 4 chr5 sense_intronic 24 chr5 unitary_pseudogene 4 chr5 snRNA 103 chr5 transcribed_unprocessed_pseudogene 22 chr5 processed_transcript 23 chr5 misc_RNA 119 chr5 TEC 73 chr5 sense_overlapping 10 chr5 transcribed_processed_pseudogene 9 chr5 processed_pseudogene 625 chr5 vaultRNA 1 chr5 antisense 297 chr5 snoRNA 40 chr6 protein_coding 1045 chr6 TEC 36 chr6 polymorphic_pseudogene 3 chr6 snRNA 107 chr6 antisense 238 chr6 non_coding 1 chr6 transcribed_unitary_pseudogene 2 chr6 processed_transcript 8 chr6 miRNA 60 chr6 sense_overlapping 7 chr6 processed_pseudogene 639 chr6 rRNA 27 chr6 snoRNA 39 chr6 all_gene 2863 chr6 lincRNA 360 chr6 unprocessed_pseudogene 105 chr6 sense_intronic 23 chr6 unitary_pseudogene 10 chr6 scaRNA 1 chr6 misc_RNA 105 chr6 3prime_overlapping_ncRNA 2 chr6 transcribed_processed_pseudogene 23 chr6 transcribed_unprocessed_pseudogene 22 chr7 protein_coding 906 chr7 TEC 37 chr7 polymorphic_pseudogene 2 chr7 snRNA 84 chr7 antisense 252 chr7 pseudogene 3 chr7 transcribed_unprocessed_pseudogene 71 chr7 processed_transcript 40 chr7 miRNA 60 chr7 sense_overlapping 12 chr7 processed_pseudogene 577 chr7 TR_D_gene 1 chr7 rRNA 24 chr7 snoRNA 39 chr7 sRNA 1 chr7 all_gene 2867 chr7 lincRNA 286 chr7 unprocessed_pseudogene 189 chr7 sense_intronic 15 chr7 unitary_pseudogene 13 chr7 TR_J_gene 19 chr7 TR_V_pseudogene 17 chr7 TR_C_gene 4 chr7 transcribed_unitary_pseudogene 2 chr7 misc_RNA 143 chr7 TR_V_gene 57 chr7 transcribed_processed_pseudogene 13 chrX protein_coding 841 chrX pseudogene 3 chrX polymorphic_pseudogene 1 chrX snRNA 84 chrX misc_RNA 100 chrX transcribed_unitary_pseudogene 2 chrX processed_transcript 10 chrX TEC 14 chrX sense_overlapping 6 chrX processed_pseudogene 723 chrX rRNA 22 chrX snoRNA 46 chrX all_gene 2359 chrX lincRNA 136 chrX miRNA 105 chrX unprocessed_pseudogene 108 chrX sense_intronic 17 chrX unitary_pseudogene 3 chrX scaRNA 1 chrX antisense 101 chrX 3prime_overlapping_ncRNA 1 chrX transcribed_processed_pseudogene 12 chrX transcribed_unprocessed_pseudogene 23 chr8 protein_coding 676 chr8 TEC 32 chr8 IG_V_pseudogene 1 chr8 snRNA 84 chr8 antisense 253 chr8 transcribed_unitary_pseudogene 5 chr8 processed_transcript 16 chr8 miRNA 68 chr8 sense_overlapping 8 chr8 processed_pseudogene 468 chr8 rRNA 29 chr8 polymorphic_pseudogene 1 chr8 all_gene 2353 chr8 lincRNA 412 chr8 unprocessed_pseudogene 114 chr8 sense_intronic 45 chr8 unitary_pseudogene 4 chr8 misc_RNA 82 chr8 3prime_overlapping_ncRNA 1 chr8 transcribed_processed_pseudogene 14 chr8 snoRNA 33 chr8 transcribed_unprocessed_pseudogene 7 chr9 protein_coding 781 chr9 TEC 18 chr9 polymorphic_pseudogene 2 chr9 snRNA 65 chr9 misc_RNA 96 chr9 IG_C_pseudogene 1 chr9 pseudogene 1 chr9 transcribed_unprocessed_pseudogene 28 chr9 processed_transcript 17 chr9 miRNA 74 chr9 sense_overlapping 9 chr9 processed_pseudogene 461 chr9 ribozyme 2 chr9 rRNA 19 chr9 snoRNA 31 chr9 all_gene 2242 chr9 lincRNA 275 chr9 IG_V_pseudogene 4 chr9 unprocessed_pseudogene 132 chr9 sense_intronic 22 chr9 unitary_pseudogene 5 chr9 TR_V_pseudogene 4 chr9 scaRNA 1 chr9 transcribed_unitary_pseudogene 2 chr9 antisense 165 chr9 3prime_overlapping_ncRNA 1 chr9 transcribed_processed_pseudogene 23 chr9 TR_V_gene 3 chr11 protein_coding 1276 chr11 TEC 76 chr11 polymorphic_pseudogene 22 chr11 snRNA 71 chr11 antisense 326 chr11 transcribed_unprocessed_pseudogene 30 chr11 processed_transcript 20 chr11 pseudogene 2 chr11 sense_overlapping 17 chr11 all_gene 3235 chr11 processed_pseudogene 473 chr11 ribozyme 1 chr11 rRNA 24 chr11 snoRNA 52 chr11 sRNA 2 chr11 miRNA 81 chr11 lincRNA 305 chr11 unprocessed_pseudogene 279 chr11 sense_intronic 40 chr11 unitary_pseudogene 20 chr11 scaRNA 3 chr11 transcribed_unitary_pseudogene 3 chr11 misc_RNA 97 chr11 3prime_overlapping_ncRNA 1 chr11 transcribed_processed_pseudogene 14 chr10 protein_coding 732 chr10 pseudogene 1 chr10 IG_V_pseudogene 1 chr10 snRNA 86 chr10 misc_RNA 89 chr10 transcribed_unprocessed_pseudogene 39 chr10 processed_transcript 29 chr10 TEC 31 chr10 sense_overlapping 9 chr10 processed_pseudogene 422 chr10 ribozyme 2 chr10 rRNA 29 chr10 snoRNA 28 chr10 polymorphic_pseudogene 1 chr10 all_gene 2204 chr10 lincRNA 280 chr10 miRNA 61 chr10 unprocessed_pseudogene 85 chr10 sense_intronic 33 chr10 unitary_pseudogene 5 chr10 transcribed_unitary_pseudogene 2 chr10 antisense 226 chr10 transcribed_processed_pseudogene 13 chr12 protein_coding 1034 chr12 TEC 99 chr12 snRNA 99 chr12 antisense 304 chr12 non_coding 1 chr12 transcribed_unprocessed_pseudogene 18 chr12 processed_transcript 22 chr12 miRNA 62 chr12 sense_overlapping 5 chr12 processed_pseudogene 490 chr12 rRNA 27 chr12 snoRNA 37 chr12 all_gene 2940 chr12 lincRNA 438 chr12 unprocessed_pseudogene 69 chr12 sense_intronic 75 chr12 unitary_pseudogene 6 chr12 scaRNA 2 chr12 transcribed_unitary_pseudogene 5 chr12 misc_RNA 115 chr12 3prime_overlapping_ncRNA 2 chr12 macro_lncRNA 1 chr12 transcribed_processed_pseudogene 29 chr13 protein_coding 327 chr13 unprocessed_pseudogene 52 chr13 all_gene 1304 chr13 lincRNA 241 chr13 snRNA 41 chr13 miRNA 31 chr13 antisense 105 chr13 sense_intronic 39 chr13 unitary_pseudogene 1 chr13 transcribed_unitary_pseudogene 1 chr13 transcribed_unprocessed_pseudogene 15 chr13 processed_transcript 12 chr13 misc_RNA 75 chr13 TEC 29 chr13 snoRNA 17 chr13 processed_pseudogene 295 chr13 rRNA 15 chr13 transcribed_processed_pseudogene 8 chr14 protein_coding 623 chr14 IG_C_gene 9 chr14 TEC 20 chr14 polymorphic_pseudogene 3 chr14 TR_J_pseudogene 4 chr14 snRNA 64 chr14 IG_pseudogene 1 chr14 antisense 179 chr14 IG_V_gene 49 chr14 IG_C_pseudogene 2 chr14 non_coding 1 chr14 pseudogene 1 chr14 transcribed_unprocessed_pseudogene 14 chr14 processed_transcript 21 chr14 miRNA 91 chr14 ribozyme 2 chr14 IG_J_pseudogene 3 chr14 processed_pseudogene 324 chr14 TR_D_gene 3 chr14 rRNA 10 chr14 snoRNA 72 chr14 bidirectional_promoter_lncRNA 1 chr14 all_gene 2224 chr14 lincRNA 297 chr14 IG_J_gene 6 chr14 IG_V_pseudogene 84 chr14 unprocessed_pseudogene 47 chr14 sense_intronic 21 chr14 unitary_pseudogene 2 chr14 TR_J_gene 60 chr14 TR_V_pseudogene 9 chr14 TR_C_gene 2 chr14 scaRNA 2 chr14 transcribed_unitary_pseudogene 2 chr14 misc_RNA 79 chr14 IG_D_gene 27 chr14 transcribed_processed_pseudogene 30 chr14 sense_overlapping 11 chr14 TR_V_gene 48 chr15 protein_coding 597 chr15 TEC 47 chr15 IG_V_pseudogene 6 chr15 snRNA 64 chr15 misc_RNA 93 chr15 IG_V_gene 6 chr15 pseudogene 1 chr15 sense_intronic 79 chr15 processed_transcript 25 chr15 all_gene 2152 chr15 sense_overlapping 8 chr15 processed_pseudogene 287 chr15 rRNA 12 chr15 snoRNA 114 chr15 miRNA 57 chr15 lincRNA 300 chr15 unprocessed_pseudogene 118 chr15 IG_D_gene 10 chr15 unitary_pseudogene 1 chr15 scaRNA 3 chr15 transcribed_unitary_pseudogene 3 chr15 antisense 225 chr15 3prime_overlapping_ncRNA 2 chr15 transcribed_processed_pseudogene 29 chr15 transcribed_unprocessed_pseudogene 65 chr16 protein_coding 866 chr16 TEC 136 chr16 IG_V_pseudogene 8 chr16 snRNA 52 chr16 antisense 307 chr16 IG_V_gene 6 chr16 pseudogene 1 chr16 transcribed_unprocessed_pseudogene 53 chr16 processed_transcript 31 chr16 all_gene 2511 chr16 sense_overlapping 12 chr16 processed_pseudogene 252 chr16 rRNA 32 chr16 snoRNA 33 chr16 polymorphic_pseudogene 1 chr16 miRNA 69 chr16 lincRNA 360 chr16 unprocessed_pseudogene 112 chr16 sense_intronic 83 chr16 unitary_pseudogene 12 chr16 scaRNA 1 chr16 transcribed_unitary_pseudogene 2 chr16 misc_RNA 51 chr16 3prime_overlapping_ncRNA 5 chr16 bidirectional_promoter_lncRNA 1 chr16 transcribed_processed_pseudogene 25 chr17 transcribed_unprocessed_pseudogene 64 chr17 protein_coding 1197 chr17 unprocessed_pseudogene 105 chr17 TEC 99 chr17 lincRNA 338 chr17 snRNA 86 chr17 miRNA 75 chr17 misc_RNA 99 chr17 transcribed_unitary_pseudogene 3 chr17 unitary_pseudogene 6 chr17 scaRNA 5 chr17 sense_intronic 69 chr17 processed_transcript 48 chr17 antisense 376 chr17 all_gene 2995 chr17 sense_overlapping 3 chr17 transcribed_processed_pseudogene 43 chr17 processed_pseudogene 310 chr17 rRNA 17 chr17 snoRNA 52 chr18 protein_coding 270 chr18 TEC 50 chr18 snRNA 46 chr18 misc_RNA 41 chr18 IG_C_pseudogene 1 chr18 transcribed_unitary_pseudogene 1 chr18 processed_transcript 11 chr18 miRNA 30 chr18 sense_overlapping 1 chr18 processed_pseudogene 207 chr18 rRNA 13 chr18 snoRNA 17 chr18 sRNA 1 chr18 all_gene 1170 chr18 lincRNA 246 chr18 unprocessed_pseudogene 14 chr18 sense_intronic 49 chr18 unitary_pseudogene 3 chr18 scaRNA 2 chr18 antisense 146 chr18 transcribed_processed_pseudogene 14 chr18 transcribed_unprocessed_pseudogene 7 chr20 protein_coding 544 chr20 all_gene 1386 chr20 lincRNA 214 chr20 snRNA 46 chr20 miRNA 39 chr20 unprocessed_pseudogene 32 chr20 sense_intronic 20 chr20 unitary_pseudogene 5 chr20 transcribed_unitary_pseudogene 3 chr20 scaRNA 1 chr20 transcribed_unprocessed_pseudogene 8 chr20 processed_transcript 9 chr20 misc_RNA 68 chr20 TEC 10 chr20 antisense 138 chr20 transcribed_processed_pseudogene 3 chr20 processed_pseudogene 198 chr20 sense_overlapping 3 chr20 rRNA 20 chr20 snoRNA 25 chr19 protein_coding 1470 chr19 TEC 74 chr19 lincRNA 236 chr19 snRNA 29 chr19 miRNA 115 chr19 unprocessed_pseudogene 159 chr19 polymorphic_pseudogene 2 chr19 sense_intronic 55 chr19 misc_RNA 61 chr19 unitary_pseudogene 7 chr19 transcribed_unprocessed_pseudogene 53 chr19 processed_transcript 36 chr19 antisense 293 chr19 3prime_overlapping_ncRNA 2 chr19 all_gene 2926 chr19 sense_overlapping 6 chr19 snoRNA 22 chr19 processed_pseudogene 263 chr19 rRNA 13 chr19 transcribed_processed_pseudogene 30 chrY protein_coding 53 chrY unprocessed_pseudogene 216 chrY all_gene 523 chrY lincRNA 52 chrY snRNA 17 chrY antisense 5 chrY sense_intronic 1 chrY transcribed_unprocessed_pseudogene 27 chrY processed_transcript 1 chrY misc_RNA 7 chrY snoRNA 3 chrY processed_pseudogene 130 chrY rRNA 7 chrY transcribed_processed_pseudogene 4 chr22 protein_coding 438 chr22 IG_C_gene 4 chr22 TEC 9 chr22 IG_V_pseudogene 43 chr22 snRNA 26 chr22 antisense 134 chr22 IG_V_gene 37 chr22 IG_C_pseudogene 5 chr22 transcribed_unprocessed_pseudogene 36 chr22 processed_transcript 18 chr22 miRNA 32 chr22 sense_overlapping 9 chr22 processed_pseudogene 156 chr22 rRNA 6 chr22 snoRNA 11 chr22 polymorphic_pseudogene 2 chr22 all_gene 1318 chr22 lincRNA 165 chr22 IG_J_gene 7 chr22 unprocessed_pseudogene 75 chr22 sense_intronic 31 chr22 unitary_pseudogene 3 chr22 scaRNA 3 chr22 misc_RNA 62 chr22 transcribed_processed_pseudogene 6 chr21 protein_coding 233 chr21 all_gene 835 chr21 lincRNA 191 chr21 snRNA 21 chr21 miRNA 26 chr21 unprocessed_pseudogene 32 chr21 IG_V_gene 1 chr21 sense_intronic 17 chr21 pseudogene 1 chr21 transcribed_unprocessed_pseudogene 6 chr21 processed_transcript 7 chr21 misc_RNA 24 chr21 3prime_overlapping_ncRNA 1 chr21 TEC 16 chr21 antisense 81 chr21 transcribed_processed_pseudogene 3 chr21 processed_pseudogene 143 chr21 sense_overlapping 8 chr21 rRNA 9 chr21 snoRNA 15 chrMT Mt_rRNA 2 chrMT protein_coding 13 chrMT Mt_tRNA 22 chrMT all_gene 37
可视化一下吧
因为all_gene值太高,所以把它单独拿出来
grep "all_gene" count_all.txt >count_gene.txt
library(ggplot2) data <- read.table("C:/Users/pc/Desktop/count_gene.txt") chr_name <- data$V1type <- data$V2 number <- data$V3p <- ggplot(data,aes(x=chr_name,y=number))+theme_bw()+theme(axis.text.x=element_text(angle=90,size =8,vjust = 0.5,hjust = 0.5),panel.grid = element_blank())+geom_bar(aes(fill=chr_name),stat = "identity",position="dodge",width=0.8)
image.png
所有类型基因的可视化如下
看一下一号染色体:
grep "chr1 " count_else.txt >count_else.chr1.txt
image.png
编码蛋白的基因数量还是最多的
最后show一下怎么把多个染色体的图堆在一起,用到ggplot2中的facet
grep "chr2" count_else.txt >count_else.chr2_all.txt#取出2,20,22号染色体
image.png
这里只是想展示一下这种画图方法,但是如果把所有染色体都放上就压缩了,图看不清楚。
2.统计一下基因转录本数量的分布
脚本如下:
import sysimport collections args=sys.argv filename=args[1] count_transcript=collections.OrderedDict()with open (filename) as fh: for line in fh: lineL=line.strip().split("\t") chr_num="chr"+lineL[0] type_info=lineL[2] if type_info=="transcript": if chr_num not in count_transcript: count_transcript[chr_num]={} descrip_info=lineL[8] descrip_infoL=descrip_info.split("; ") gene_ID=descrip_infoL[0] gene_IDL=gene_ID.split(" ") gene_ID=gene_IDL[1] gene_ID=eval(gene_ID) if gene_ID not in count_transcript[chr_num]: count_transcript[chr_num][gene_ID]=0 count_transcript[chr_num][gene_ID]+=1for chr_num,countAll in count_transcript.items(): for k,v in countAll.items():
作者:天秤座的机器狗
链接:https://www.jianshu.com/p/5ffc2b487109
共同学习,写下你的评论
评论加载中...
作者其他优质文章