snp关联分析
概念
SNP:不同个体DNA序列上的单个碱基差异称为单核苷酸多态性(SNP)
最小等位基因频率(MAF):某个研究群体中出现较少的等位频率称为最小等位频率,以此将SNP分为常见和罕见。
链锁块:染色体上距离越近的SNP越倾向于以一个整体遗传给后代,这样位于某一区域的一组相互关联的SNP成为一个链锁块
链锁不平衡:指相邻基因位点上等位基因的非随机性相关 某一位点上的特定等位基因与同一条染色体另一基因位点上的某等位基因同时出现的几率大于人群中因随机分布而使两等位基因同时出现的几率。
关联研究通过检验某个特定的等位在疾病组和对照组中出现的频率差异来判断此等位是否是疾病的易感等位,发现与该等位呈链锁不平衡的致病基因
拷贝数变异(CNV):指基因组拷贝数在不同个体间存在差异的现象,主要包括缺失、插入和重复等,长度较长,100nt以上,可以覆盖整条染色体臂或者整条染色体,只有1%左右的插入和缺失超过100nt,但还是会对表型产生较大的影响
indel:缺失和插入长度在11-100nt之间的这些短片段的拷贝数变化被称为插入/缺失变异。
简单重复序列(1-220bp)在基因组上大量存在,并且易出现变异,一段连续的重复序列成为一个重复排列,100bp以下的称为==微卫星DNA==,又称短串联重复多态性,常分布于常染色质。
结构变异:倒置、易位
调控区的突变可以干扰DNA结合蛋白的结合能力,从而影响基因的表达水平,影响基因表达水平的突变称为表达定量性状位点(eQTL)。
位于外显子/内含子交接处、或者剪接相关的位点发生突变,可能导致剪接过程的改变,从而产生不同的可变剪接过程,称为剪接位点。
单倍体不足:多数的功能丢失通突变是隐性性状,但是机体不能承受杂合子中约50%蛋白质活性的下降。
ped 和 map文件详解
ped
ped 文件必须accompanied by a map file
ped文件没有header line,每行有6+2v fileds.V是SNP的数目。
前6行:
-
Family ID (‘FID’)
-
Within-family ID (sample ID) (‘IID’; cannot be ‘0’)
-
Within-family ID of father (Paternal ID)(‘0’ if father isn’t in dataset)
- Within-family ID of mother (Maternal ID)(‘0’ if mother isn’t in dataset)
- Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown)
-
Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/’0’/non-numeric = missing data if case/control)
- 第7th 8th 是第一个SNP的alleles. 9th, 10th 是第二个SNP的alleles. 以此类推。。。 0 0代表missing data
map
map文件没有header
Each line corresponds to a SNP.
one line per SNP with 4 fields
- Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
- Variant identifier
- Position in morgans or centimorgans (optional; also safe to use dummy value of ‘0’)
- Base-pair coordinate
snp plink分析
Making a binary PED file
The first thing we will do is to make a binary PED file. This more compact representation of the data saves space and speeds up subsequent analysis. To make a binary PED file, use the following command.
plink --file hapmap1 --make-bed --out hapmap1
-
When using the –make-bed option, the threshold filters for missing rates and allele frequency were automatically set to exclude nobody. Although these filters can be specified manually (using –mind, –geno and –maf) to exclude people, this default tends to be wanted when creating a new PED or binary PED file. The commands –extract / –exclude and –keep / –remove can also be applied at this stage.
- Family存在fam格式中
- 保留、删除样本 plink –keep/remove mylist.txt
- 保留删除SNP –extract –exclude
- 提取位于特定位置 chr5:1253287-1295162(某一范围)
提取位于某个list 下(提取与TERT基因)
awk '{print $2}' demo_tmp.map >demo_tmp.list
转二进制
/home/courses/cw/software/plink/plink --file demo_tmp --make-bed --out demo_tmp
awk '$6==2{print $1,$2}' demo_tmp.fam > demo_tmp_ID
#
/home/courses/cw/software/plink/plink --bfile demo --extract demo_tmp.list --keep demo_tmp_ID --recode --out demo_tmp1
提取EGFR、ERBB3等基因上的遗传变异
EGFR chr7:55086714-55324313
/home/courses/cw/software/plink/plink --bfile demo --chr 7 --from-bp 55086714 --to-bp 55324313 --out demo_tmp_2 --recode
/home/courses/cw/software/plink/plink --file demo_tmp --merge demo_tmp_2.ped demo_tmp_2.map --out demo_tmp3 --recode
ln -s /home/courses/cw/data_toy/plink/demo_* ./
paste <(ls demo_*.ped | grep -v tmp) <(ls demo_*.map | grep -v tmp) | awk 'NR>1' > merge.list
/home/courses/cw/software/plink/plink --file merge.list --out demo_tmp3 –make-bed
/home/courses/cw/software/plink/plink --file demo_10 --merge-list merge.list --out demo_tmp4 --make-bed
计算样本亲属关系
Step1
stu14230117@bioinfo:~/plink$ ln -s /home/courses/cw/data_toy/plink/all1.bed ./
stu14230117@bioinfo:~/plink$ ln -s /home/courses/cw/data_toy/plink/all1.bim ./
stu14230117@bioinfo:~/plink$ ln -s /home/courses/cw/data_toy/plink/all1.fam ./
/home/courses/cw/software/plink/plink --bfile all1 --check-sex --out all_sex
awk '$5!="OK"{print $1,$2}' all_sex.sexcheck > all1.ID
/home/courses/cw/software/plink/plink --bfile all1 --remove all1.ID --out all2 --make-bed
Step2质量控制
awk '{if ($1<1||$1>22) print $2}' all2.bim > SNP_noauto.list
/home/courses/cw/software/plink/plink --bfile all2 --exclude SNP_noauto.list --out all3 --make-bed
Step3:删除不合格的SNP
/home/courses/cw/software/plink/plink --bfile all3 --geno 0.05 --maf 0.05 --hwe 0.001 --out all4 --make-bed
Step4 #删除不合格的样本
/home/courses/cw/software/plink/plink --bfile all4 --mind 0.05 --make-bed --out all5
Step5:亲缘关系估计
/home/courses/cw/software/plink/plink --bfile all5 --exclude high-LD-regions.txt --range --genome --min 0.25 --out all5_gen
/home/courses/cw/software/plink/plink --bfile all5 --missing --out all5_mis
perl run-IBD-QC.pl all5
/home/courses/cw/software/plink/plink --bfile all5 --remove fail-IBD-QC.txt --make-bed --out all6
Step6:将亲属关联起来
ln -s /home/courses/cw/data_toy/plink/high-LD-regio.txt
/home/courses/cw/software/plink/plink --bfile all6 --logistic --ci 0.95 --out all6_all
Working with the binary PED file
plink --bfile hapmap1
Summary statistics: missing rates
Next, we shall generate some simple summary statistics on rates of missing data in the file, using the –missing option:
plink --bfile hapmap1 --missing --out miss stat