NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
Difference between revisions of "Load GenBank into Chado"
(Use recent Bioperl) |
m |
||
(4 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
==Abstract== | ==Abstract== | ||
− | This HOWTO describes how to load GenBank format files into [[Chado]]. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see: | + | This [[:Category:HOWTO|HOWTO]] describes how to load GenBank format files into [[Chado]]. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see: |
− | http://eugenes.org/gmod/genbank2chado/ | + | :http://eugenes.org/gmod/genbank2chado/ |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==Summary== | ==Summary== | ||
Line 31: | Line 12: | ||
* Install prerequisites: latest versions of [[Chado]] and [[GBrowse]] | * Install prerequisites: latest versions of [[Chado]] and [[GBrowse]] | ||
* Fetch Genbank genome/chromosomes | * Fetch Genbank genome/chromosomes | ||
− | * Run | + | * Run <tt>[http://code.open-bio.org/svnweb/index.cgi/bioperl/view/bioperl-live/trunk/scripts/Bio-DB-GFF/genbank2gff3.PLS genbank2gff3]</tt> script from [[BioPerl]] (Important: use a version of the script created April 2007 or later) |
− | * Run | + | * Run <tt>gmod_bulk_load_gff3.pl</tt> script (from GMOD) |
− | * View genome(s) with GBrowse (see an | + | * View genome(s) with [[GBrowse]] (see an [http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ example at eugenes.org]). |
In summary, to load ''Saccharomyces'' chromosome X to Chado database 'mychado', from a Unix command-line, do: | In summary, to load ''Saccharomyces'' chromosome X to Chado database 'mychado', from a Unix command-line, do: | ||
Line 46: | Line 27: | ||
mkdir data; cd data | mkdir data; cd data | ||
− | + | ||
− | Fetch from NCBI, or this Indiana mirror | + | Fetch from NCBI, or this Indiana mirror |
curl ftp://bio-mirror.net/biomirror/ncbigenomes/ | curl ftp://bio-mirror.net/biomirror/ncbigenomes/ | ||
Line 53: | Line 34: | ||
Other sample genomes of interest: | Other sample genomes of interest: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | * <tt>Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz</tt> | ||
+ | * <tt>Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz</tt> | ||
+ | * <tt>Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz</tt> | ||
+ | * <tt>M_musculus/CHR_19/mm_ref_chr19.gbk.gz</tt> | ||
+ | * <tt>H_sapiens/CHR_19/hs_ref_chr19.gbk.gz</tt> | ||
− | |||
− | The [ | + | ==Create GFF3 from the Genbank Files== |
+ | |||
+ | The [[BioPerl]] script <code>bp_genbank2gff3.pl</code> (<tt>scripts/Bio-DB-GFF/genbank2gff3.PLS</tt>) will convert to [[GFF3]] suited to [[Chado]] loading. '''Important''': use a version of the script created April 2007 or later. | ||
The new <code>-noCDS</code> flag is required for this. Use <code>-s</code> flag to summarize features found. | The new <code>-noCDS</code> flag is required for this. Use <code>-s</code> flag to summarize features found. | ||
− | + | ||
bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz | bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz | ||
− | |||
− | |||
− | Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle | + | ==Load GFF3 into Chado== |
+ | |||
+ | Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle '''one''' organism at a time. Chose the best <tt>--dbxref</tt> per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes. | ||
gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff | gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff | ||
− | + | ||
Check data: | Check data: | ||
Line 103: | Line 84: | ||
− | '''Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table''' | + | '''Your [[GFF3]] file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table''' |
− | Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology]. | + | Solution: This error message will be followed by [[Glossary#SQL|SQL]] statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology]. |
Line 111: | Line 92: | ||
'''CONTEXT: COPY featureprop, line ...''' | '''CONTEXT: COPY featureprop, line ...''' | ||
− | Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the | + | Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the [[GFF3]] file. Correct these errors by hand and reload. |
+ | ==Authors== | ||
+ | |||
+ | * [[User:Dongilbert|Don Gilbert]] | ||
+ | * [[bp:Brian_Osborne|Brian Osborne]] | ||
[[Category:HOWTO]] | [[Category:HOWTO]] | ||
[[Category:Chado]] | [[Category:Chado]] |
Latest revision as of 21:49, 30 December 2008
Contents
Abstract
This HOWTO describes how to load GenBank format files into Chado. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:
Summary
- Install prerequisites: latest versions of Chado and GBrowse
- Fetch Genbank genome/chromosomes
- Run genbank2gff3 script from BioPerl (Important: use a version of the script created April 2007 or later)
- Run gmod_bulk_load_gff3.pl script (from GMOD)
- View genome(s) with GBrowse (see an example at eugenes.org).
In summary, to load Saccharomyces chromosome X to Chado database 'mychado', from a Unix command-line, do:
curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \ | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \ | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata
Fetch Genbank Genome Files
Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/
mkdir data; cd data
Fetch from NCBI, or this Indiana mirror
curl ftp://bio-mirror.net/biomirror/ncbigenomes/ curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz
Other sample genomes of interest:
- Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
- Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
- Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
- M_musculus/CHR_19/mm_ref_chr19.gbk.gz
- H_sapiens/CHR_19/hs_ref_chr19.gbk.gz
Create GFF3 from the Genbank Files
The BioPerl script bp_genbank2gff3.pl
(scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF3 suited to Chado loading. Important: use a version of the script created April 2007 or later.
The new -noCDS
flag is required for this. Use -s
flag to summarize features found.
bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz
Load GFF3 into Chado
Use the GMOD script gmod_bulk_load_gff3.pl
for this. Note that gmod_bulk_load_gff3
will only handle one organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
Check data:
psql -d dev_chado_01c -c 'select count(f.*), \ (select common_name from organism where organism_id = f.organism_id) as species \ from feature f group by f.organism_id;' psql -d dev_chado_01c -c 'select count(f.*), \ (select common_name from organism where organism_id = f.organism_id) as species \ from feature f where f.seqlen>0 group by f.organism_id;'
Possible Errors
It's possible that you'll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.
couldn't open /var/lib/gmod/conf directory for reading:No such file or directory
Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:
setenv GMOD_ROOT /usr/local/gmod/ # tcsh
or
set GMOD_ROOT=/usr/local/gmod/ # bash
Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table
Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.
DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...
Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF3 file. Correct these errors by hand and reload.