Expert answer:Please read the attached paper and answer each rubric question with 100 words or more. So, you’ll see the rubric question and able to answer the questions from the Bioinformatic reading.
f17_chipseq_paper_rubric.pdf

bioinformatic_comp.gen.pdf

Unformatted Attachment Preview

Points
Discussion point
Excellent
(Student
coherently
answers the
question with
little to no
grammatical
errors.)
Fair
(Student
somewhat
answers the
questions with
a few
grammatical
errors.)
Poor
(Student fails to
answer the
question or
demonstrates
complete lack of
understanding of
the proposed
discussion point.)
In your own words
describe what the
methodology of a ChIP
sequencing
10­9
8­5
4­0
Describe how this paper
approaches the NGS
analytical flow. Give
details for each step.
10­9
8­5
4­0
Describe the
information portrayed
in Figure 3 of the paper.
What is the data’s
purpose in context of
the experiment?
10­9
8­5
4­0
What role do peaks
serve in ChIP­seq
analysis? How can
figure 5 be interpreted
in the context of peaks?
10­9
8­5
4­0
Describe the purpose of
differential analysis and
the methodology behind
this type of analysis.
10­9
8­5
4­0
HHS Public Access
Author manuscript
Author Manuscript
Bio Protoc. Author manuscript; available in PMC 2017 May 15.
Published in final edited form as:
Bio Protoc. 2017 February 5; 7(3): . doi:10.21769/BioProtoc.2123.
Bioinformatic Analysis for Profiling Drug-induced Chromatin
Modification Landscapes in Mouse Brain Using ChlP-seq Data
Yong-Hwee Eddie Loh, Jian Feng, Eric Nestler, and Li Shen*
Fishberg Department of Neuroscience and Friedman Brain Institute, Icahn School of Medicine at
Mount Sinai, New York, USA
Author Manuscript
Abstract
Author Manuscript
Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-seq) is a
powerful technology to profile genome-wide chromatin modification patterns and is increasingly
being used to study the molecular mechanisms of brain diseases such as drug addiction. This
protocol discusses the typical procedures involved in ChIP-seq data generation, bioinformatic
analysis, and interpretation of results, using a chronic cocaine treatment study as a template. We
describe an experimental design that induces significant chromatin modifications in mouse brain,
and the use of ChIP-seq to derive novel information about the chromatin regulatory mechanisms
involved. We describe the bioinformatic methods used to preprocess the sequencing data, generate
global enrichment profiles for specific histone modifications, identify enriched genomic loci, find
differential modification sites, and perform functional analyses. These ChIP-seq analyses provide
many details into the chromatin changes that are induced in brain by chronic exposure to cocaine,
and generates an invaluable source of information to understand the molecular mechanisms
underlying drug addiction. Our protocol provides a standardized procedure for data analysis and
can serve as a starting point for any other ChIP-seq projects.
Keywords
Chromatin immunoprecipitation (ChIP); Next generation sequencing (NGS); ChIP-seq; Cocaine;
Bioinformatics; Epigenetics; Histone modifications
Background
Author Manuscript
Chromatin modification has been implicated in the molecular mechanisms of drug addiction
and may hold the key to understanding multiple aspects of addictive behaviors (Robison and
Nestler, 2011). Chromatin Immuno-Precipitation (ChIP) followed by massively parallel
sequencing (ChIP-seq) is the current state of the art technology to profile the chromatin
landscape. The typical procedure of ChIP-seq involves: 1) using the antibody against a
protein of interest to pull down the binding DNA, which has been fixed to the protein and
broken into smaller fragments, 2) the immunoprecipitated DNA is then purified and
constructed into a library for high throughput sequencing of short reads (usually 50–100 bp)
from the ends of insert DNA fragments, 3) the short reads are aligned to the genome and put
*
For correspondence: li.shen@mssm.edu.
Loh et al.
Page 2
Author Manuscript
through data analysis. Compared with its predecessor – ChIP-chip, ChIP-seq has
unparalleled advantages such as unbiased coverage of the entire genome, single base
resolution, and significantly improved signal-to-noise ratio (Park, 2009). It has proven to be
an invaluable tool to understand numerous types of chromatin modifications.
Author Manuscript
Brief overview of ChIP-seq experiment: Please see original research articles for greater
experimental details (Renthal et al., 2009; Lee et al., 2006). Briefly, adult mice received a
standard regimen of repeated cocaine (7 daily IP injections of cocaine [20 mg/kg] or saline)
and were used 24 h after the last injection (Robison and Nestler, 2011). The nucleus
accumbens, a major brain reward region, was obtained by punch dissection and used for
ChIP-seq. Chromatin IP was performed as described previously (Renthal et al., 2009; Lee et
al. 2006; http://jura.wi.mit.edu/young_public/hES_PRC/ChIP.html) with minor
modifications, using two antibodies, anti-H3K4me3 (tri-methylation of Lys4 in histone H4)
(Abcam #ab8580) and anti-H3K9me3 (Abcam #ab8898). Sequencing libraries for each
experimental condition were generated in triplicate, and were then sequenced on an Illumina
sequencer.
Bioinformatic analysis
The sequencing data generated from libraries under treatment and corresponding control
conditions will be used to identify the specific genomic regions that have undergone
chromatin changes. We can then associate these regions with biological functions and select
the regions of interest for further study. The basic procedure for this kind of bioinformatic
analysis can be laid out as a pipeline of eight steps (Figure 1):
Author Manuscript
1.
Perform quality control analyses on the sequencing data files;
2.
Align the sequences to the genome;
3.
Remove PCR duplicates;
4.
Perform cross-correlation quality analysis;
5.
Generate coverage files to be loaded into a genome browser;
6.
Generate global coverage plots;
7.
Perform peak detection and annotation;
8.
Perform differential analysis and functional association.
Author Manuscript
This protocol shall explain each of these steps in more detail, including some of the
principles and rationale underlying the bioinformatics analyses, and provide the necessary
commands for execution in a Unix-like command-line environment.
Equipment
1.
Personal computer or high performance computing cluster. All software
mentioned here can be run under a Unix-like workstation, such as Linux and
Mac. For Windows users, a terminal emulator, such as Cygwin (http://
cygwin.com/), can be used.
Bio Protoc. Author manuscript; available in PMC 2017 May 15.
Loh et al.
Page 3
Software
Author Manuscript
Author Manuscript
Author Manuscript
1.
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
2.
Bowtie2 (Langmead and Salzberg, 2012; http://bowtie-bio.sourceforge.net/
bowtie2/index.shtml)
3.
Samtools (Li et al., 2009; http://samtools.sourceforge.net/)
4.
phantompeakqualtools (http://code.google.com/archive/p/
phantompeakqualtools/)
5.
IGV and IGVtools (Robinson et al., 2011; http://software.broadinstitute.org/
software/igv/)
6.
ngs.plot (Shen et al., 2014; http://github.com/shenlab-sinai/ngsplot)
7.
MACS (Zhang et al., 2008; http://github.com/taoliu/MACS)
8.
diffReps (Shen et al., 2013; http://github.com/shenlab-sinai/diffreps)
9.
bedtools (Quinlan and Hall, 2010; http://bedtools.readthedocs.io/en/latest/)
10.
region-analysis (Shao et al., 2016; http://github.com/shenlab-sinai/
region_analysis)
11.
ChIPseqRUs (Loh et al., 2016; https://github.com/shenlab-sinai/chipseq_preprocess)
12.
David (Dennis et al., 2003; http://david.ncifcrf.gov/)
13.
IPA (http://www.ingenuity.com/)
14.
GREAT (McLean et al., 2010; http://bejerano.stanford.edu/great/public/html/)
1.
Perform quality control analyses on the sequencing data files
Procedure
Author Manuscript
We strongly recommend checking the quality of the ChIP-seq sequencing data
before moving on to the next steps. FastQC is an excellent program we use
which performs a series of quality control steps for a next generation sequencing
(NGS) dataset. Some of the more important quality reports to take note of in the
evaluation of sequencing quality include the ‘Per base sequence quality’, which
should show consistently high quality values across the reads; the ‘Per base
sequence content’, which should show non-random distribution of the
nucleotides at each base; and ‘Sequence duplication levels’, which should not be
excessive. The FastQC documentation in the web link above provides some
examples of reports of good and bad sequencing data may look like.
fastqc -o path_to_output_directory -t
number_of_threads_to_usesequence_file.fastq
Bio Protoc. Author manuscript; available in PMC 2017 May 15.
Loh et al.
Page 4
Author Manuscript
Note: In addition to command line execution, FastQC can also be run as an
interactive graphical application. Also, definitions for command line parameters
(e.g., ‘-o’, ‘-t’) for FastQC and all other programs used in this protocol can be
found in their respective weblinks as provided in the ‘Software’ section above.
2.
Alignment of the sequences to the genome
a.
For each sample, we performed the alignment of the raw fastq sequence
reads to the reference mouse genome using the Bowtie2 program.
Bowtie2 generates alignment results in the Sequence Alignment/Map
(SAM) format (Li et al., 2009), which contains alignment information
including the short read ID, chromosome name, start position, strand,
mismatch information, and the raw nucleotides of the read, among
others.
Author Manuscript
bowtie2 -p number_of_compute_cores_to_use -x
path_to_mm9_genome_index -1
sequence_file_forward_reads.fastq -2
sequence_file_reverse_reads.fastq -S alignment_file.sam
Note: In cases of single-end sequencing alignments, remove ‘-2
sequence_file_reverse_reads.fastq’ from the above command.
b.
Author Manuscript
From the SAM file generated, we need to extract only the reads that
mapped uniquely to the genome. As multi-mapped reads are
represented by Bowtie2 in the SAM file only by way of a ‘XS:i:value’
tag:value pair, we will use the linux ‘grep’ command to extract all result
lines that do not include ‘XS:i:’ in them.
grep -v “XS:i:” alignment_file.sam > alignment_unique.sam
c.
Author Manuscript
The SAM format files are text files which typically end up very large
and unwieldy for the tens of millions of reads aligned per sample. A
companion to SAM, the Binary Alignment/Map (BAM) format (Li et
al., 2009), contains the exact same information as in SAM, but enables
efficient compression and storing of the alignment data, and also allows
for efficient retrieval of the aligned reads. The BAM format is now
widely accepted by most programs processing alignment data. Using
the samtools program, we shall convert and sort the SAM file into the
BAM format for use in subsequent steps.
samtools view -Sb alignment_unique.sam >
alignment_unique.bam
samtools sort alignment_unique.bam alignment_sorted
Bio Protoc. Author manuscript; available in PMC 2017 May 15.
Loh et al.
Page 5
3.
Removing PCR duplicates
Author Manuscript
Author Manuscript
Author Manuscript
a.
The presence of potentially redundant short reads, caused by PCR
amplification, represents an intrinsic limitation of second generation
sequencing technology. Because PCR preferentially amplifies DNA
fragments with certain nucleotide compositions, it makes some
genomic locations overrepresented in a ChIP-seq library. While it may
be argued that this overrepresentation bias applies equally to treatment
and control conditions and can therefore be normalized, it actually
perturbs the distribution of the underlying data and may inflate
statistical significance. Take a hypothetical example (Table 1), where
the actual number of fragments in library A and B are 1 and 3,
respectively. If we simply multiply this number by ten then we will
have 10 fragments in library A and 30 in library B. Both cases give us
the same fold change of 3.0 but dramatically different p-values if
assessed by a Chi-square test.
b.
To remove the PCR duplicates, we utilize the fact that, during the
preparation of a ChIP-seq library, a sonicator breaks the whole genome
almost uniformly so that most fragments come from unique positions
(because a mammalian genome is ~2–3 billion bp long). This process
may remove ‘innocent’ fragments that truly come from the same
position and strand. But this happens less commonly in comparison
with PCR duplication, and, as a general rule in statistical inference, it is
always better to be conservative than to make false claims. Therefore,
we have made removal of PCR duplicates a part of our ChIP-seq
processing pipeline (Loh et al., 2016).
samtools rmdup -s alignment_sorted.bam alignment_rmdup.bam
Author Manuscript
Note: In cases of paired-end alignments, use the -S option in place of
the -s option. Special note: When micrococcal nuclease is used in place
of sonication during fragmentation step in ChIP preparation, the
redundancy removing procedure should not be applied. This is because
the strong preference for micrococcal nuclease to cut a DNA sequence
at particular positions (with specific nucleotide compositions) can lead
to a lot of genuine duplicated fragments. We are not aware of any
available approach to specifically remove PCR-based duplication in
such a library.
4.
Perform cross-correlation quality analysis
The analysis of strand cross-correlation can be used as a ChIP-seq quality metric
(Kharchenko et al., 2008; Landt et al., 2012). The principle behind this analysis
is that high-quality ChIP-seq experiments are expected to generate clusters of
mapped reads on the forward and reverse strands, with the ChIP binding site
Bio Protoc. Author manuscript; available in PMC 2017 May 15.
Loh et al.
Page 6
Author Manuscript
centered between them. By increasingly shifting the reads in the direction of the
strand they map to and then calculating the between-strand Pearson correlation
of read depth at all positions, the plot of cross-correlation coefficient against
strand-shift usually yields two peaks, one corresponding to the read length (the
so called ‘phantom’ peak) and the other to the average fragment length of the
library. The absolute and relative heights of these peaks are then used to calculate
two quality metrics, the normalized strand coefficient (NSC) and the relative
strand correlation (RSC). The Encode Consortium ChIP-seq guidelines state that
high quality ChIP-seq datasets typically have NSC > 1.1 and RSC > 1, and
recommend the generation of additional replicates when NSC < 1.05 and RSC < 0.8 (Landt et al., 2012). Here, we use the run_SPP_nodups.R script of the phantompeakqualtools package to calculate the NSC and RSC values. An example of the NSC and RSC metrics and cross-correlation plot generated by the script is shown in Figure 2. Author Manuscript run_spp_nodups.R -savp -c=alignment_rmdup.bam out=phantompeak_output.txt Notes: Author Manuscript 5. a. We recommend assessing the NSC and RSC values with some caution. They tend to be more informative for chromatin marks and transcription factors with punctuated peaks than broad peaks. If your sequencing target is a chromatin mark with broad pattern peaks, such as the H3K9me2, the NSC and RSC values may fall below the Encode suggested criteria. On the other hand, we have found some input control samples to have very high NSC and RSC values. b. The ‘phantom’ peak is a phenomenon caused by biased mappability of the genome. When a genomic region has good mappability, so does its reverse complement. Therefore, both strands of the genomic region will get an equal amount of aligned reads. Because the read depth is calculated based on the 5′ ends of the aligned reads, a strand-shift of the read length will yield a local maximum – the ‘phantom’ peak. Generation of coverage files to be loaded into a genome browser Author Manuscript The effective visualization of a ChIP-seq sample provides valuable information to a bench biologist (Figure 3). Through the use of a genome browser, an investigator can easily see which genomic regions are enriched by ChIP. In Figure 3, the y-axis gives the number of short reads aligned to that location, which can also be normalized by library size for comparing two or more ChIPseq samples. This can be done by generating a so-called coverage file for each ChIP-seq sample. Many options are available in choosing a genome browser. Most genome browsers support its own file format that best suits its implementation, but often also supports some other common formats. The UCSC Bio Protoc. Author manuscript; available in PMC 2017 May 15. Loh et al. Page 7 Author Manuscript genome browser (Kent et al., 2002) is one of the earliest genome browsers and probably still the most comprehensive one. It is a web-based application and usually requires a user to upload the coverage files to a remote server. As sequencing technology advances rapidly, so does the size of coverage files generated from ChIP-seq samples. Therefore, it has become very inconvenient to upload those large files to a remote server. We therefore recommend using a genome browser that can work on a local machine. A very good choice is the IGV genome browser which is a Java-based cross-platform application. It supports a binary format called TDF and provides a utility program (igvtools) for users to generate a TDF file from a BAM format alignment file or several other commonly used formats. Author Manuscript igvtools count alignment_rmdup.bam alignment_rmdup.tdf mm9 6. Generation of global coverage plots Author Manuscript A global coverage plot that shows the averaged coverage over all instances of annotated genomic features, such as transcription start sites (TSS), is often extremely helpful in determining enrichment, assessing IP efficiency, or correlating with gene groups. An example is shown in Figure 4, which demonstrates that the enrichment of H3K4me3 correlates positively with gene expression levels. The generation of such a plot is straightforward. The basic principle is first to calculate the genomic coverage vector for a ChIP-seq sample. One then retrieves the genomic coordinates for the selected genomic feature from an annotation database. With the genomic coordinates, the coverage values are then extracted and averaged. We have developed an open source software package called ngs.plot, based on the statistical package R, that would generate these global coverage plots. The command shown below executes a basic ngs.plot analysis with the four mandatory arguments required. Additional customizations of the analysis and plotting can be performed with additional arguments (https://github.com/shenlab-sinai/ngsplot/wiki weblink). ngs.plot is also available as an add-on to Galaxy (Goecks et al., 2010), a popular web-based genomic analysis framework. ngs.plot.r -G mm9 -R tss -C alignment_rmdup.bam -O outputfile_prefix Author Manuscript Note: All the above-mentioned steps 1–6 can be considered as pre-processing steps that would be needed to be applied to all samples. While the individual command lines to run each step have been provided here, we have also developed an open source automated pipeline (Loh et al., 2016) (Available at http:// github.com/shenlab-sinai/chip-seq_preprocess/) that would automatically run all these steps on all samples. The pipeline also features parallel processing and automatic resume of interrupted jobs. Please refer to the URL above for Bio Protoc. Author manuscript; available in PMC 2017 May 15. Loh et al. Page 8 Author Manuscript instructions on how to install and use the pipeline, which would not be covered here. 7. Peak detection and annotation Author Manuscript Determining binding enrichments for a ChIP-seq library with rigorous statistical evaluation (peak calling) is often extremely useful. Global statistics can be obtained on the peaks to determine their d ... Purchase answer to see full attachment