NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
Difference between revisions of "GBrowse NGS Tutorial"
m (reformatting, removing #icon stuff) |
m (Text replace - "<python>" to "<syntaxhighlight lang="python">") |
||
Line 802: | Line 802: | ||
Create a file called <tt>relocate.py</tt> and populate it with this code: | Create a file called <tt>relocate.py</tt> and populate it with this code: | ||
− | <python>#!/usr/bin/env /usr/bin/python | + | <syntaxhighlight lang="python">#!/usr/bin/env /usr/bin/python |
# -*- coding: iso-8859-1 -*- | # -*- coding: iso-8859-1 -*- | ||
#------------------------------------------------------------------- | #------------------------------------------------------------------- |
Revision as of 18:11, 9 October 2012
This GBrowse Next Generation Sequencing tutorial was presented by Dave Clements at the Bioinformatics Australia 2009 (BA2009), October 2009. The most recent GBrowse tutorial can be found at the GBrowse Tutorial page.
This tutorial walks you through how to configure the GBrowse genome browser to display Next Generation Sequencing (NGS) data using the SAMtools GBrowse adaptor, Bio::DB::Sam.
Contents
VMware
This tutorial was taught using a VMware system image as a starting point. If you want to start with the same system, download and install the Starting image. See VMware for what software you need to use a VMware system image, and for directions on how to get the image setup and running on your machine.
The starting image used for this tutorial is a modified copy of the one used at the start of the 2009 GMOD Summer School - Europe. See below for what was added to the starting image for this tutorial.
Download |
---|
Starting Image Username: gmod |
Caveats
Important Note
This tutorial describes the world as it existed on the day the tutorial was given. Please be aware that things like CPAN modules, Java libraries, and Linux packages change over time, and that the instructions in the tutorial will slowly drift over time. Newer versions of tutorials will be posted as they become available.
Introduction
The tutorial will show how to display next generation sequence (NGS) data in GBrowse 2 using SAMtools. We'll use the example human data that comes with SAMtools to do that. First we'll load the reference sequence and some gene models.
How the Starting Image was created
The starting VMware image was created by starting with the starting image used in the 2009 GMOD Summer School - Europe. Several key prerequisites, including GBrowse 2 and SAMtools, were then installed on that, and then the starting image was created.
GBrowse 2 is already up and running on the image. This tutorial tells you how to add human data to the default installation, and then how to add NGS data to it, in the form of a Bio::DB::Sam/SAMtools database.
Passwords
- Login: gmod
- Password: gmod
- MySQL root password: none
Fluxbox
The VMware uses the minimal Fluxbox GUI. Some details will be explained here.
Prerequisites
This sections describes what additional prerequisites were installed on the starting image to enable GBrowse 2 and SAMtools.
Everything described in this section has already been installed on the starting image. This section is merely provided for your information.
Misc
This section describes the installation Synaptic is a graphical version of the apt-get package manager for Debian based systems. Emacs is my preferred Linux Text Editor, but any will do.
sudo apt-get install synaptic sudo apt-get install emacs
Now that we have synaptic, I'll use it to upgrade every package in the system. This is necessary because several months have passed since the Europe summer school, and I want to get the latest version of everything I have:
startx
This launches the Fluxbox graphical windows manager. Now start a terminal by right clicking anywhere on the screen and selecting
- Shell → Bash
In the newly launched shell, run synaptic:
sudo synaptic
and update everything:
- Click on Reload
- Click on Mark all Upgrades
- Click on Apply
And the apply step runs for ages. Once that finishes, restart the virtual machine.
Bio::Graphics Prerequisites
sudo perl -MCPAN -e shell install Text::ParseWords quit
Don't install Bio::SCF. It has to do with Staden trace visualization. We won't be showing any.
GBrowse 2 Prerequisites
From the shell
sudo apt-get install build-essential sudo apt-get install libgd-gd2-perl sudo apt-get install libjson-perl libjson-any-perl sudo apt-get install mysql-client libmysqlclient15-dev
Bio::Graphics
Download the latest Bio::Graphics from CPAN.
cd mkdir BA2009 cd BA2009 mkdir BioGraphics cd BioGraphics wget http://mirrors.gossamer-threads.com/CPAN/authors/id/L/LD/LDS/Bio-Graphics-1.982.tar.gz tar xzvf Bio-Graphics-1.982.tar.gz cd Bio-Graphics-1.982/ perl Build.PL ./Build ./Build test sudo ./Build install
Ignoring Bio:SCF warnings. This install process takes 1-2 minutes.
GBrowse 2
cd ~/BA2009/ mkdir GBrowse2 cd GBrowse2 wget http://mirror.its.uidaho.edu/pub/cpan/authors/id/L/LD/LDS/GBrowse-1.9971.tar.gz tar xzvf GBrowse-1.9971.tar.gz cd GBrowse-1.9971/ perl Makefile.PL ... Checking prerequisites... - ERROR: Cache::MemoryCache is not installed ...
Oops.
sudo perl -MCPAN -e shell install Cache::MemoryCache ... Appending installation info to /usr/local/lib/perl/5.8.8/perllocal.pod JSWARTZ/Cache-Cache-1.06.tar.gz /usr/bin/make install UNINST=1 -- OK exit perl Makefile.PL ... Installing /usr/local/bin/gbrowse_clean.pl Writing /usr/local/lib/perl/5.8.8/auto/GBrowse/.packlist ***INSTALLATION COMPLETE*** Load http://localhost/gbrowse2 for demo and documentation
And the rest are optional prereqs.
./Build test **Beginning interactive configuration** Directory for GBrowse's config and support files? [/etc/gbrowse2] Directory for GBrowse's static images & HTML files? [/var/www/gbrowse2] Directory for GBrowse's temporary data [/var/tmp/gbrowse2] Directory for GBrowse's example databases [/var/www/gbrowse2/databases] Directory for GBrowse's CGI script executables? [/usr/lib/cgi-bin/gb2] Internet port to run demo web site on (for demo)? [8000] Apache loadable module directory (for demo)? [/usr/lib/apache2/modules] User account under which Apache daemon runs? [www-data] Automatically update Apache config files to run GBrowse? [y] Automatically update system config files to run gbrowse-slave? [y] **Interactive configuration done. Run './Build reconfig' to reconfigure** Writing config notes to blib/lib/GBrowse/ConfigData.pm ... All tests successful. Files=8, Tests=341, 23 wallclock secs ( 0.00 usr 0.11 sys + 3.87 cusr 2.27 csys = 6.25 CPU) Result: PASS sudo ./Build install
Test it at http://172.16.108.129/gbrowse2/
GBrowse 2 is installed and the default datasources appear to work.
SAMtools
Need to have libbam.a which means we need to a source install of SAMtools.
cd ~/BA2009/ mkdir SAMtools cd SAMtools wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.6/samtools-0.1.6.tar.bz2?use_mirror=softlayer tar xvfj samtools-0.1.6.tar.bz2 cd samtools-0.1.6/
The INSTALL file says:
System Requirements
SAMtools depends on the zlib library <http://www.zlib.net>. The latest version 1.2.3 is preferred and with the latest version you can compile razip and use it to compress a FASTA file. SAMtools' faidx is able to index a razip-compressed FASTA file to save diskspace. Older zlib also works with SAMtools, but razip cannot be compiled.
The text-based viewer (tview) requires the GNU ncurses library <http://www.gnu.org/software/ncurses/>, which comes with Mac OS X and most of the modern Linux/Unix distributions. If you do not have this library installed, you can still compile the rest of SAMtools by manually modifying one line in Makefile.
Compilation
Type `make' to compile samtools. If you have zlib >= 1.2.2.1, you can compile razip with `make razip'.
Installation
Simply copy `samtools' and other executables/scripts in `misc' to a location you want (e.g. a directory in your $PATH). No further configurations are required.
So
make ... gcc -c -g -Wall -O2 -D_FILE_OFFSET_BITS=64 -D_USE_KNETFILE -D_CURSES_LIB=1 bam_tview.c -o bam_tview.o bam_tview.c:5:20: error: curses.h: No such file or directory bam_tview.c:7:2: warning: #warning "_CURSES_LIB=1 but NCURSES_VERSION not defined; tview is NOT compiled" bam_tview.c:409:2: warning: #warning "No curses library is available; tview is disabled." make[1]: *** [bam_tview.o] Error 1 make[1]: Leaving directory `/home/gmod/BA2009/SAMtools/samtools-0.1.6' make: *** [all-recur] Error 1
So, get ncurses. libncurses5 is already installed.
sudo apt-get libncurses5-dev make clean make
Make got a few warnings, but no errors. So, it appears to have made. bam.h and libbam.a are both in the home/gmod/BA2009/SAMtools/samtools-0.1.6 directory.
Bio::DB::Sam Adaptor
cd ~/BA2009/ mkdir GBrowse-Adaptors cd GBrowse-Adaptors/ svn co https://gmod.svn.sourceforge.net/svnroot/gmod/gbrowse-adaptors/ cd gbrowse-adaptors/trunk/Bio-SamTools/
From the README file:
Then set the environment variable SAMTOOLS to point to this directory and run:
perl Build.PL ./Build ./Build test (sudo) ./Build install
Where is SAMTOOLS?
$ export SAMTOOLS=/home/gmod/BA2009/SAMtools/samtools-0.1.6 $ perl ./Build.PL Found /home/gmod/BA2009/SAMtools/samtools-0.1.6/bam.h and /home/gmod/BA2009/SAMtools/samtools-0.1.6/libbam.a. Checking whether your kit is complete... Looks good Checking prerequisites... Looks good Creating new 'Build' script for 'Bio-SamTools' version '1.06' $ ./Build Copying lib/Bio/DB/Bam/Pileup.pm -> blib/lib/Bio/DB/Bam/Pileup.pm Copying lib/Bio/DB/Bam/AlignWrapper.pm -> blib/lib/Bio/DB/Bam/AlignWrapper.pm Copying lib/Bio/DB/Sam.pm -> blib/lib/Bio/DB/Sam.pm Copying lib/Bio/DB/Sam/Segment.pm -> blib/lib/Bio/DB/Sam/Segment.pm Copying lib/Bio/DB/Sam/Constants.pm -> blib/lib/Bio/DB/Sam/Constants.pm Copying lib/Bio/DB/Bam/PileupWrapper.pm -> blib/lib/Bio/DB/Bam/PileupWrapper.pm Copying lib/Bio/DB/Bam/Alignment.pm -> blib/lib/Bio/DB/Bam/Alignment.pm Copying lib/Bio/DB/Bam/Target.pm -> blib/lib/Bio/DB/Bam/Target.pm Copying lib/Bio/DB/Bam/ReadIterator.pm -> blib/lib/Bio/DB/Bam/ReadIterator.pm Copying lib/Bio/DB/Bam/FetchIterator.pm -> blib/lib/Bio/DB/Bam/FetchIterator.pm Copying lib/Bio/DB/Bam/Query.pm -> blib/lib/Bio/DB/Bam/Query.pm lib/Bio/DB/Sam.xs -> lib/Bio/DB/Sam.c cc -I/home/gmod/BA2009/SAMtools/samtools-0.1.6 -I/usr/lib/perl/5.8/CORE -DXS_VERSION="1.06" -DVERSION="1.06" -fPIC -D_IOLIB=2 -D_FILE_OFFSET_BITS=64 -Wformat=0 -c -D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -o lib/Bio/DB/Sam.o lib/Bio/DB/Sam.c ExtUtils::Mkbootstrap::Mkbootstrap('blib/arch/auto/Bio/DB/Sam/Sam.bs') cc -shared -L/usr/local/lib -o blib/arch/auto/Bio/DB/Sam/Sam.so lib/Bio/DB/Sam.o -L/home/gmod/BA2009/SAMtools/samtools-0.1.6 -lbam -lz Manifying blib/lib/Bio/DB/Sam/Constants.pm -> blib/libdoc/Bio::DB::Sam::Constants.3pm Manifying blib/lib/Bio/DB/Bam/AlignWrapper.pm -> blib/libdoc/Bio::DB::Bam::AlignWrapper.3pm Manifying blib/lib/Bio/DB/Bam/Alignment.pm -> blib/libdoc/Bio::DB::Bam::Alignment.3pm Manifying blib/lib/Bio/DB/Sam.pm -> blib/libdoc/Bio::DB::Sam.3pm Manifying blib/lib/Bio/DB/Bam/PileupWrapper.pm -> blib/libdoc/Bio::DB::Bam::PileupWrapper.3pm Manifying blib/lib/Bio/DB/Bam/Pileup.pm -> blib/libdoc/Bio::DB::Bam::Pileup.3pm Manifying blib/lib/Bio/DB/Bam/Query.pm -> blib/libdoc/Bio::DB::Bam::Query.3pm $ ./Build test t/01sam.t .. 1/93 [fai_load] build FASTA index. t/01sam.t .. ok All tests successful. Files=1, Tests=93, 2 wallclock secs ( 0.01 usr 0.01 sys + 0.81 cusr 0.87 csys = 1.70 CPU) Result: PASS $ sudo ./build install nstalling /usr/local/lib/perl/5.8.8/auto/Bio/DB/Sam/Sam.bs Installing /usr/local/lib/perl/5.8.8/auto/Bio/DB/Sam/Sam.so Files found in blib/arch: installing files in blib/lib into architecture dependent library tree Installing /usr/local/lib/perl/5.8.8/Bio/DB/Sam.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/FetchIterator.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/Pileup.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/Target.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/Query.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/Alignment.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/AlignWrapper.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/ReadIterator.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Bam/PileupWrapper.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Sam/Constants.pm Installing /usr/local/lib/perl/5.8.8/Bio/DB/Sam/Segment.pm Installing /usr/local/man/man3/Bio::DB::Bam::PileupWrapper.3pm Installing /usr/local/man/man3/Bio::DB::Bam::Pileup.3pm Installing /usr/local/man/man3/Bio::DB::Sam::Constants.3pm Installing /usr/local/man/man3/Bio::DB::Bam::Query.3pm Installing /usr/local/man/man3/Bio::DB::Sam.3pm Installing /usr/local/man/man3/Bio::DB::Bam::Alignment.3pm Installing /usr/local/man/man3/Bio::DB::Bam::AlignWrapper.3pm Writing /usr/local/lib/perl/5.8.8/auto/Bio/SamTools/.packlist
And the adaptor is now in.
All prerequisite software for this tutorial is now installed. The starting VMware image for this tutorial was taken at this point.
On some system on error occurs like this
$ ./Build cc -shared -O2 -g -L/usr/local/lib -o blib/arch/auto/Bio/DB/Sam/Sam.so lib/Bio/DB/Sam.o -L/home/pgeller/BA2009/SAMtools/samtools-0.1.6/ -lbam -lz /usr/bin/ld: /home/pgeller/BA2009/SAMtools/samtools-0.1.6//libbam.a(bgzf.o): relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC /home/pgeller/BA2009/SAMtools/samtools-0.1.6//libbam.a: could not read symbols: Bad value collect2: ld returned 1 exit status error building blib/arch/auto/Bio/DB/Sam/Sam.so from lib/Bio/DB/Sam.o at /usr/share/perl/5.10/ExtUtils/CBuilder/Base.pm line 213.
If this is the case go back to SAM tools folder and recompile with the -fPIC parameter
make clean make CXXFLAGS=-fPIC CFLAGS=-fPIC CPPFLAGS=-fPIC
Start
Everything above this point describes how the starting image was created. You can now start the tutorial.
If you haven't done so already, download the starting image for this image, unpack it, and start it.
Adding Human Data
SAMtools comes with some example data covering small parts of human chromosomes 2 and 20. This section describes how to add a GBrowse instance for viewing human chromosomes 2 and 20, in addition to the default GBrowse instances.
We are going to use that data to demonstrate how GBrowse visualizes short read data using SAMtools.
Getting Human Data
Note: The FASTA and GFF3 files described here have already been downloaded into the starting VMware image.
It would be nice to show context, such as the reference sequence and predicted genes, with our short reads. To do this we will need to get GFF3 with gene models, and FASTA for the sequence. We can do that.
FASTA
At UCSC, build 36.1 is available by chromosome:
- http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/
- http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr2.fa.gz
- http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr20.fa.gz
According to http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/:
- The chr*_random sequences are unplaced sequence on those reference chromosomes.
So, we will ignore those.
cd ~/BA2009 mkdir HumanData cd HumanData wget http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr20.fa.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr2.fa.gz gunzip chr20.fa.gz gunzip chr2.fa.gz
Take a look at the first line of each file
head -1 chr2.fa chr20.fa
This should say the names of the sequences are "chr2" and "chr20". This matters because the chromosome names occur in several places and serve to tie everything together.
GFF3
GFF3 is a widely used standard format for genomic annotation.
When you are looking for GFF3 for an organism is a good place to start is GBrowse.org, which lists GBrowse related resources for many organisms.
However, I know that the CBRG at the University of Oxford does a lot with GBrowse and human data, so I asked Simon McGowan if he could make the GFF3 for human chromosomes 2 and 3 available, and he did:
wget http://www.compbio.ox.ac.uk/data/HUMAN_HG18/ensembl/chr2_ens_annots.gff wget http://www.compbio.ox.ac.uk/data/HUMAN_HG18/ensembl/chr20_ens_annots.gff
- Everything up to here has already been done in the starting VMware image.
- Now we start doing things.
Past experience has taught that whenever you get a GFF3 file from elsewhere it is a darn fine idea to check that the reference sequence that all features are positioned on is defined at the top of the file.
head -5 chr2_ens_annots.gff chr20_ens_annots.gff
Nope.
To add reference sequence you need to know how long it is. Since I got the files from CBRG, I went to their GBrowse instance (https://gbrowse.molbiol.ox.ac.uk/cgi-bin/gbrowse/HUMAN_HG18/) and typed in "chr2" and then "chr20" in the search box, and let GBrowse tell me how big they were:
- Showing 242.8 Mbp from chr2, positions 1 to 242,751,149
- Showing 62.44 Mbp from chr20, positions 1 to 62,435,964
I could have also done some quick calculation using the size of the FASTA files.
Edit chr2_ens_annots.gff and add this line, separate each item by tabs:
chr2 Ensembl chromosome 1 242751149 . . . Name=chr2
Edit chr20_ens_annots.gff and add this line, separate each item by tabs:
chr20 Ensembl chromosome 1 62435964 . . . Name=chr20
Gene Definitions in these GFF3 Files
An example gene definition from the GFF3
chr20 Ensembl mRNA 56701315 56724307 . + . ID=NPEPL1;Name=NPEPL1;description=Probable aminopeptidase NPEPL1 (EC 3.4.11.-) (Aminopeptidase-like 1).;ensgene_id=ENSG00000215440 chr20 Ensembl exon 56701315 56701513 . + . ID=ENSE00001542441;Parent=NPEPL1 chr20 Ensembl exon 56702200 56702385 . + . ID=ENSE00001542440;Parent=NPEPL1 ...
Columns
- 1: Reference sequence
- 2: Source: A comment that identifies where this feature came from or what it means
- 3: Type: Sequence Ontology term describing this feature
- 4, 5: Start position and end positions, 1 relative
- 6, 7, 8: Score, Strand, Phase
- 9: Attributes
- ID: Uniquely identifies feature
- Name: used for searches and display
- Parent: Who this feature belongs to in a multi-part feature.
Genes are defined as mRNAs containing one or more exons, in this file.
Displaying Human Data in GBrowse
Now that we have our reference sequence and the Ensembl gene models for chromosomes 2 and 20, lets show them.
Create human.conf
First we need to create a new configuration file for our human data. GBrowse configuration files on Debian based systems go in
/etc/gbrowse2/
To do this, I copied yeast_chr1+2.conf and edited it. Here are the initial contents of the file:
[GENERAL] description = Human Chromosomes 2 and 20 database = annotations initial landmark = chr20:67960..69550 default tracks = PredictedGenes # examples to show in the introduction examples = chr20:67960..69559 chr2:2043960..2045540 # bring in the special Submitter plugin plugins = FastaDumper RestrictionAnnotator SequenceDumper TrackDumper Submitter # "automatic" classes to try when an unqualified identifier is given # automatic classes = Symbol Gene Clone ################################# # database definitions ################################# [annotations:database] db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor DBI::mysql -dsn human search options = default +autocomplete # Advanced feature: custom balloons custom balloons = [balloon] delayTime = 500 [balloon500] maxWidth = 500 delayTime = 50 # Advanced feature: an example of callbacks to be run remotely # by gbrowse_details for AJAX/iframe balloons [TOOLTIPS] intro = sub { my $args = shift; my $feat = $args->{feature}; my $name = $feat->display_name; my $type = $feat->primary_tag; my $class = $feat->class; my $extra = join(' ',$feat->each_tag_value('Note')) if $feat->has_tag('Note'); my $n = $type =~ /^[AEIOU]/i ? 'n' : ''; my $msg = "Hello, I am $name, a$n $type of class $class"; $msg .= "<br>I am described as a <i><b>$extra</b></i>" if $extra; $msg .= "<br>Click to see the sequence of $name"; return "<table><tr><th bgcolor='lightsteelblue'>$name</th><tr>" . "<tr><td>$msg</td></tr></table>"; } full_sequence = sub { my $args = shift; my $feat = $args->{feature}; my $name = $feat->display_name; my $seq = $feat->seq->seq; $seq =~ s/(\S{75})/$1\n/g; return "<pre style='font-size:8pt'>>$name\n$seq</pre>"; } # Advanced feature: # Pop up rubberband menus for submitting selected region to search engines... #include "detail_select_menu.conf" #include "submitter_plugin.conf" # Default glyph settings [TRACK DEFAULTS] glyph = generic database = annotations height = 8 bgcolor = cyan fgcolor = black label density = 25 bump density = 100 # default pop-up balloon balloon hover = <b>$name</b> is a $type spanning $ref from $start to $end. Click for more details. ### TRACK CONFIGURATION #### # the remainder of the sections configure individual tracks [DNA] glyph = dna global feature = 1 height = 40 do_gc = 1 category = Basic features gc_window = auto strand = both fgcolor = red axis_color = blue discoverable = 0 key = DNA [Translation] glyph = translation global feature = 1 height = 40 category = Basic features fgcolor = purple strand = +1 translation = 6frame key = 6-frame translation [PredictedGenes] feature = mRNA glyph = so_transcript description = 1 bgcolor = beige category = Basic features key = Ensembl predicted genes
What it Means
Lets go over some of the more important sections.
[GENERAL] description = Human Chromosomes 2 and 20 database = annotations initial landmark = chr20:67960..69550 default tracks = PredictedGenes # examples to show in the introduction examples = chr20:67960..69559 chr2:2043960..2045540
- The title of the GBrowse instance in users' browsers.
- The default DB will be "annotations" - more on that in a bit
- What region and tracks GBrowse will show when it first comes up.
- What example regions will be displayed as hot links.
[annotations:database] db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor DBI::mysql -dsn human search options = default +autocomplete
- Which adaptor we are going to use
- GBrowse has lots of adaptors; we'll use 2 in this example
- Which DBMS (MySQL) and database ("human") to get data from.
# Default glyph settings [TRACK DEFAULTS] glyph = generic database = annotations height = 8 bgcolor = cyan fgcolor = black label density = 25 bump density = 100 # default pop-up balloon balloon hover = <b>$name</b> is a $type spanning $ref from $start to $end. Click for more details.
- Track default values. These can be overridden in each track.
- Glyph determine how data is shown
- as a box, linked boxes, xyplot, ...
[DNA] glyph = dna global feature = 1 height = 40 do_gc = 1 category = Basic features gc_window = auto strand = both fgcolor = red axis_color = blue discoverable = 0 key = DNA [Translation] glyph = translation global feature = 1 height = 40 category = Basic features fgcolor = purple strand = +1 translation = 6frame key = 6-frame translation
- Tracks will show the DNA and codon translation.
- Both the "dna" and "translation" glyphs are smart.
- They use semantic zooming to display information in different ways at high zoom and low zoom.
[PredictedGenes] feature = mRNA glyph = so_transcript description = 1 bgcolor = beige category = Basic features key = Ensembl predicted genes
- Show the predicted Ensembl genes.
- The "so_transcript" is also smart and knows how to display multi-part features.
- Genes are defined in the GFF as mRNAs containing one or more exons.
- This track definition is more typical then the DNA and Translation tracks.
- Of particular importance is:
- feature = mRNA
- Tells GBrowse what subset of features from the GFF files to display in this track.
- GBrowse scans column 3 (the Sequence Ontology term) and only displays items with this feature type.
Our example gene definition from the GFF3
chr20 Ensembl mRNA 56701315 56724307 . + . ID=NPEPL1;Name=NPEPL1; ... chr20 Ensembl exon 56701315 56701513 . + . ID=ENSE00001542441;Parent=NPEPL1 chr20 Ensembl exon 56702200 56702385 . + . ID=ENSE00001542440;Parent=NPEPL1 ...
The file also contains information on popups and hovering. This is leftover from the original yeast file. I'm not going to talk about it, but we'll leave it there because it is nice to have.
Create the Human Database
In the human.conf file, we told GBrowse that the data would come from a MySQL database called "human":
[annotations:database] db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor DBI::mysql -dsn human
We now have to create it:
mysql -uroot create database human; grant all privileges on human.* to gmod@localhost; grant select on human.* to 'www-data'@'localhost'; quit
"www-data" is the Apache user on Ubuntu. When GBrowse runs, by default it will run as the same user as Apache.
Load the Gene Models and Sequence
The database now exists, but it's empty.
GBrowse has many adaptors that it can use to get data with. We aren't going to use it, but one of the simplest is the memory adaptor. This adaptor just reads files from a directory into memory and uses them there.
The default location for these directories on Ubuntu is
/var/www/gbrowse2/databases/
If you look at that directory you'll notice that the default installation has already created several directories there.
Since we are going to store the data in a database, it doesn't really matter where we put the files. However, to make it easier to keep track of what data was loaded, I like to create a new directory in the default location, and place data files there before loading them. It helps me remember a week later what was uploaded.
sudo mkdir /var/www/gbrowse2/databases/human_annotations sudo chown www-data:www-data /var/www/gbrowse2/databases/human_annotations sudo chmod 755 /var/www/gbrowse2/databases/human_annotations cd /var/www/gbrowse2/databases/human_annotations sudo cp /home/gmod/BA2009/HumanData/chr2_ens_annots.gff . sudo cp /home/gmod/BA2009/HumanData/chr2.fa . sudo cp /home/gmod/BA2009/HumanData/chr20_ens_annots.gff . sudo cp /home/gmod/BA2009/HumanData/chr20.fa .
We told GBrowse to use the Bio::DB::SeqFeature::Store Perl module to retrieve the data. That means will load the data with a BioPerl script that uses that module to load it.
bp_seqfeature_load.pl -c -f -d human chr2_ens_annots.gff chr2.fa 2>&1 | tee /tmp/bpsl.log1 bp_seqfeature_load.pl -f -d human chr20_ens_annots.gff chr20.fa 2>&1 | tee /tmp/bpsl.log2
Tell GBrowse About the new Human Data
GBrowse data sources (that is, different instances of GBrowse on your server) are defined in
/etc/gbrowse2/GBrowse.conf
And it is currently write protected:
sudo chmod 644 /etc/gbrowse2/GBrowse.conf
Edit that file and add this stanza at the end:
[human] description = Human Chromosomes 2 and 20 path = human.conf
Test It
You can either do this from Firefox inside your VMware image, or you can do it from a browser in your host operating system.
From inside VMware, in Firefox type:
To use a browser on your host system first type this command inside the VMware system:
ifconfig ... inet addr:<b>172.16.108.129</b> Bcast:172.16.108.255 Mask:255.255.255.0 ...
Then type the inet addr into your browser. In this case:
You should see something like this:
Play around with it a little. Enable different tracks, scroll and zoom in and out.
Add Short Read Data
We now have a GBrowse instance that shows human chromosomes 2 and 20. If we had other annotation we wanted to show, we could add it incrementally from here.
And we do! Lets show the example data that comes with SAMtools. This can be found in
/home/gmod/BA2009/SAMtools/samtools-0.1.6/examples/
The 00README.txt says:
File ex1.fa contains two sequences cut from the human genome build36. They were extracted with command:
samtools faidx human_b36.fa 2:2043966-2045540 20:67967-69550
Sequence names were changed manually for simplicity. File ex1.sam.gz contains MAQ alignments extracted with:
(samtools view NA18507_maq.bam 2:2044001-2045500; samtools view NA18507_maq.bam 20:68001-69500)
and processed with samtools fixmate to make it self-consistent as a standalone alignment.
To try samtools, you may run the following commands:
samtools faidx ex1.fa # index the reference FASTA samtools import ex1.fa.fai ex1.sam.gz ex1.bam # SAM->BAM samtools index ex1.bam # index BAM samtools tview ex1.bam ex1.fa # view alignment samtools pileup -cf ex1.fa ex1.bam # pileup and consensus samtools pileup -cf ex1.fa -t ex1.fa.fai ex1.sam.gz
Well, crap!
If you read that carefully, it looks like the short read data has been munged to make our lives easier so we don't have to go find a complete FASTA and genome annotations to see the data.
Lets take a look at the SAM file.
cd /home/gmod/BA2009/SAMtools/samtools-0.1.6/examples/ gunzip ex1.sam.gz head -2 ex1.sam B7_591:4:96:693:509 73 seq1 1 99 36M * 0 0 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG <<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 MF:i:18 Aq:i:73 NM:i:0 UQ:i:0 H0:i:1 H1:i:0 EAS54_65:7:152:368:113 73 seq1 3 99 35M * 0 0 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): MF:i:18 Aq:i:66 NM:i:0 UQ:i:0 H0:i:1 H1:i:0
Sure enough the RNAME column (col 3) is "seq1" when we want it to be "chr2" and the POS and MPOS columns (cols 4 & 8) are 1 and 0, when we want them to be 2043966 and 2043965 (or something like that).
Unmunge the data
So, we write a quick and dirty script to fix the file:
cd ~/BA2009/HumanData/
Create a file called relocate.py and populate it with this code:
#!/usr/bin/env /usr/bin/python # -*- coding: iso-8859-1 -*- #------------------------------------------------------------------- """ SAMtools comes with example data from human chromosomes 2 and 20, but it has been relocated from its place in those chroms to start at position 1. This script moves it back. """ import sys CHR2_OFFSET = 2043965 CHR20_OFFSET = 67966 # ------------------------------------------------------------------ # MAIN # ------------------------------------------------------------------ config = { "SAM_INPUT_FILE": "/home/gmod/BA2009/SAMtools/samtools-0.1.6/examples/ex1.sam", "SAM_OUTPUT_FILE": "ex1.sam" } # File has 12 or 17 tab separate columns. Update 3 of them. samIn = open(config["SAM_INPUT_FILE"], "r") samOut = open(config["SAM_OUTPUT_FILE"], "w") line = samIn.readline() while (line): # split it by tabs cols = line.split("\t") if len(cols) != 17 and len(cols) != 12: # sanity check print "Warning: Line has " + str(len(cols)) + " columns\n" + line + "\n" # Fix sequence name if cols[2] == "seq1": cols[2] = "chr2" elif cols[2] == "seq2": cols[2] = "chr20" else: print "unknown seq: " + cols[2] + "\n" sys.exit(0) # Fix Pos & mate pos if cols[2] == "chr2": cols[3] = str(int(cols[3]) + CHR2_OFFSET) cols[7] = str(int(cols[7]) + CHR2_OFFSET) else: cols[3] = str(int(cols[3]) + CHR20_OFFSET) cols[7] = str(int(cols[7]) + CHR20_OFFSET) # all happy now. outLine = "\t".join(cols) samOut.write(outLine) line = samIn.readline() sys.exit(0)</python> And run it: <pre>chmod 755 relocate.py ./relocate.py</pre> === Convert Munged SAM to a BAM === <div class="bullets"> * SAM is a human readable text format. * GBrowse expects to find SAM data in BAM format, a binary, indexed and highly compressed format. </div> Now we invoke some SAMtools magic. First create a single FASTA file from the two: cat chr2.fa chr20.fa > chr2_20.fa '''Note:''' There may be a way to use two separate FASTA files, but I don't know how. Second, create a FASTA index file on the combined file: export SAMTOOLS=/home/gmod/BA2009/SAMtools/samtools-0.1.6 $SAMTOOLS/samtools faidx chr2_20.fa Which produces the <tt>chr2_20.fa.fai</tt> file. Now lets try to convert the SAM to BAM: $SAMTOOLS/samtools import chr2_20.fa.fai ex1.sam ex1.bam [sam_header_read2] 2 sequences loaded. === Copy the BAM files to where GBrowse can see them === The GBrowse SAMtools adaptor does expect to find the files in a particular directory. Create one in the usual place: <pre>sudo mkdir /var/www/gbrowse2/databases/humansam sudo chown www-data:www-data /var/www/gbrowse2/databases/humansam sudo chmod 755 /var/www/gbrowse2/databases/humansam sudo cp chr2_20.fa /var/www/gbrowse2/databases/humansam/ sudo cp ex1.bam /var/www/gbrowse2/databases/humansam/</pre> === Tell GBrowse About the SAM Files === To do this update the config file for human at: /etc/gbrowse2/human.conf There are several changes to make. First add the new database section right below the existing one: <pre>[humansam:database] db_adaptor = Bio::DB::Sam db_args = -fasta /var/www/gbrowse2/databases/humansam/chr2_20.fa -bam /var/www/gbrowse2/databases/humansam/ex1.bam search options = default</pre> This tells GBrowse that there is a second database to get things from, where to find it, and to use the Bio::DB::Sam adaptor to read it in. ==== Displaying Summary Data ==== Next, add in a track to show read coverage as an XY plot. At the bottom of the file add: <pre>[CoverageXyplot] feature = coverage glyph = wiggle_xyplot database = humansam height = 50 fgcolor = black bicolor_pivot = 20 pos_color = blue neg_color = red key = Coverage (xyplot) category = Reads label = 0 # Labels on wiggle tracks are redundant.</pre> Save your changes, and hit the "'''[Reset]'''" link on your GBrowse window. This should return you to where you first started except that there is now a "'''Coverage (xyplot)'''" track you can turn on. Turn it on. '''Also, turn off "Cache Tracks"'''. You want that off whenever you are experimenting with the configuration file. You should see something like: [[File:GBrowseSAMtoolsCoverageXY.png|600px|center]] This shows the read coverage. We told GBrowse to flag any read depth below 20 as red. You can also show read coverage with color intensity. Add this stanza at the end of the config file: <pre>[CoverageDensity] feature = coverage glyph = wiggle_density database = humansam height = 30 bgcolor = blue bicolor_pivot = 20 pos_color = blue neg_color = red key = Coverage (density plot) category = Reads label = 0 # labels on wiggle tracks are redundant</pre> Save your changes, and hit the "'''[Reset]'''" link again. You will now have a "Coverage (density plot)" track you can turn on. Turn it on. ==== Showing Individual Reads ==== You can also show the individual reads. Add this clause at the end of <tt>/etc/gbrowse2/human.conf</tt>: <pre>[Reads] feature = match glyph = segments draw_target = 1 show_mismatch = 1 mismatch_color = red database = humansam bgcolor = blue fgcolor = white height = 5 label density = 50 bump = fast key = Reads category = Reads</pre> Save your changes, hit the "'''[Reset]'''" link, and turn on the "Reads" track. It should look like: [[File:GBrowseSAMtoolsReads1.png|600px|center]] Zoom in. When you get close to about 100bp resolution, you'll see the individual base pairs, and mismatches will be highlighted in red. ==== Showing Mapping Quality ==== A lot of people want to show individual base read quality. We can't do that yet, but we do have programmatic access to the individual reads and can use that to display some things, such as mapping quality. We do this with a Perl ''callback'', a short (usually) snippet of code that sets a parameter based on values in the feature. The popup and hover settings in the file already use this. Here, we'll do something simple: setting the color intensity based on mapping quality. In the config file, <tt>/etc/gbrowse2/human.conf</tt>, in the <tt>[Reads]</tt> stanza, replace this lines: <pre>bgcolor = blue fgcolor = white</pre> with <pre>bgcolor = sub { my $feature = shift; my $blueness = 255 - int($feature->qual * 2.40); my $colour = chr(35) . sprintf("%X", $blueness) . sprintf("%X", $blueness) . "FF"; return $colour; } fgcolor = black</pre> High quality mappings will be dark blue, lower quality ones will be light blue. Give it a go. ==== Showing Paired End Reads ==== Finally, the example data is for paired end reads. To visualize the pairings add this stanza at the end of the config file: <pre>[Pair] feature = read_pair glyph = segments database = humansam draw_target = 1 show_mismatch = 1 bgcolor = sub { my $f = shift; return $f->attributes('M_UNMAPPED') ? 'red' : 'green'; } fgcolor = green height = 3 label = sub {shift->display_name} label density = 50 bump = fast connector = dashed balloon hover = sub { my $f = shift; return '' unless $f->type eq 'match'; return 'Read: '.$f->display_name.' : '.$f->flag_str; } key = Read Pairs category = Reads</pre> Reset GBrowse and test this. [[Category:GBrowse 2]] [[Category:Tutorials]]