CNV

baseq-CNV is a toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with Whole Genome Sequencing datas for both bulk and single cell experiments.

The copy number is based on the reads counts per genomic region. The region are predefined to exclude and discount the low complexity parts.

  • Reads Alignment using Bowtie2
  • Bin Counting for unique mapped reads
  • Normalize by GC content
  • CBS for reducing noise

Result

Name Description
sample.bowtie.bam Aligned bam file
sample.bin.counts.txt The counts of reads for each bin in the dynamic_bin_file
sample.CNV_plot_[size].png CNV plot figure for each bin-size
sample.GC.png GC content datas

Pipeline

The total pipeline

baseq-CNV run_pipeline ./Tn5_S1.fq.gz -g hg19

Alignment

We use bwa to alignment.

baseq-CNV align -1 Tn5_S1.fq.gz -r 4000000 -g hg19 -t 10

BinCounting

According to dynamicbin … The command is..

baseq-CNV bincount -g hg19 -i ./sample.bam -o normbincounts.txt

Normalize

Normalize the raw read counts.

baseq-CNV normalize -g hg19 -i ./bincounts.txt -o bincounts_norm.txt

CBS

Segmentation

baseq-CNV cbs -i ./bincounts_norm.txt  -o ./out.file
http://p8v379qr8.bkt.clouddn.com/CNV_normalize.png

Plot

Plot genomic…

baseq-CNV plotgenome -i ./bincounts_norm.txt -c ./out.file

Config

[CNV]
bowtie2 = /mnt/gpfs/Database/softs/anaconda2/bin/bowtie2
samtools = /mnt/gpfs/Database/softs/anaconda2/bin/samtools

[CNV_ref_hg19]
bowtie2_index = /mnt/gpfs/Database/ref/hg19/hg19
dynamic_bin = /mnt/gpfs/Users/zhangxiannian/basematic/cnv/hg19.dynabin.txt

Quality Control

Alignment inforamtion and MAD

  • Alignment: Total reads, mapping ratio
  • MAD : Median Absolute Deviations, indicates the technical noise level of the sample.

Dynamic Bins

Dynamic Bin: can be downloaded from github

datas containing columns.

APIs

Align

baseq.align.bowtie2.bowtie2_sort(fq1, fq2, bamfile, genome, reads=5000000, thread=8)[source]

Align the fastq reads using bowtie2 and sort the samfile.

from baseq.align.bowtie2 import bowtie2_sort

#for single reads
bowtie2_sort("read.1.fq.gz", "")

#for multiple reads
bowtie2_sort("read.1.fq.gz", "read.2.fq.gz")

Results:

sample.bam
sample.bam.stats

Bincount

baseq.cnv.bincount.counting(genome, bamfile, out)[source]

bin counting using bisect for the dynamicbin;

from baseq.cnv.bincount.counting import counting
counting("hg19", "aligned.bam", "bincount.txt")

This will generate:

bincount.txt
# A tsv contain two columns: "index/counts"

Process:

  • Read the dynamic bin;
  • Read the bamfile using samtools view command;
  • Filter the reads with mapping quality >=40;
  • Map the genome position to binID and sum;

Normalize

baseq.cnv.normalize.normalize(genome, bincount, name)[source]

Normalize the Raw bin counts with bin length and GC contents, also estimate the Ploidy.

normalize("hg19", "bincounts.txt", "CNVsample")

This will generate two files:

Norm.Counts.CNVsample.txt
'chr', 'start', 'absstart', 'norm_by_GC', 'norm_by_GC_Ploidy'
Norm.CNVsample.png

Process:

  • Read the dynamicbin;
  • Aggregate the Bins into 500kb;
  • Normalize by bin length;
  • Normalize by GC;
  • Detect the Ploidy;
Output:
GC_content_image: images Normalized bin counts (1M)

Segmentation

baseq.cnv.segment.CBS(infile, path_out)[source]

Run DNACopy.R file Uasge:

CBS("bincounts_norm.txt", "outfile.txt")

Results:

al;sdfasdfj
asdjflkajsdfklajsdf
asdlfjalskdfjlaskdjf

Visualize

whole genome

baseq.cnv.plots.genome.plot_genome(bincount, cbs_path, name)[source]

Usage:

plot_genome("sample.norm.txt", "segment.txt", "sample")
#CNV.genome.sample.png
http://p8v379qr8.bkt.clouddn.com/Genome12.png
baseq.cnv.plots.genome.plot_genome_multiple(bincount, cbs_path, path_out)[source]

Plot multiple Genomes in the same figure.

plot_genome_multiple("sample.norm.txt", "segment.txt", "sample")
http://p8v379qr8.bkt.clouddn.com/Genome_20.png

region

baseq.cnv.plots.region.plot_region(bincount, cbs_path, path_out)[source]

Plot the region of genome…

ToDo: …….