NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.

Difference between revisions of "Load GenBank into Chado"

From GMOD
Jump to: navigation, search
(Use recent Bioperl)
m
 
(4 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
==Abstract==
 
==Abstract==
  
This HOWTO describes how to load GenBank format files into [[Chado]]. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:
+
This [[:Category:HOWTO|HOWTO]] describes how to load GenBank format files into [[Chado]]. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:
  
http://eugenes.org/gmod/genbank2chado/
+
:http://eugenes.org/gmod/genbank2chado/
  
 
==Authors==
 
 
* [[Don Gilbert]]
 
* [[bp:Brian_Osborne|Brian Osborne]]
 
 
 
==Copyright==
 
 
This document is copyright Don Gilbert, 2007. For reproduction other than personal use please contact <gilbertd@cricket.bio.indiana.edu>
 
 
==Revision History==
 
 
{| border="1" cellspacing="0" cellpadding="4"
 
|-
 
| Revision 1.0 2007-04-16 BIO
 
| First version
 
|-
 
|}
 
  
 
==Summary==
 
==Summary==
Line 31: Line 12:
 
* Install prerequisites: latest versions of [[Chado]] and [[GBrowse]]
 
* Install prerequisites: latest versions of [[Chado]] and [[GBrowse]]
 
* Fetch Genbank genome/chromosomes
 
* Fetch Genbank genome/chromosomes
* Run genbank2gff3 script from [http://bioperl.org Bioperl] (Important: use a version of the script created April 2007 or later)
+
* Run <tt>[http://code.open-bio.org/svnweb/index.cgi/bioperl/view/bioperl-live/trunk/scripts/Bio-DB-GFF/genbank2gff3.PLS genbank2gff3]</tt> script from [[BioPerl]] (Important: use a version of the script created April 2007 or later)
* Run bulk_load_gff3 script (from GMOD)
+
* Run <tt>gmod_bulk_load_gff3.pl</tt> script (from GMOD)
* View genome(s) with GBrowse (see an example here: http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/)
+
* View genome(s) with [[GBrowse]] (see an [http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ example at eugenes.org]).
  
 
In summary, to load ''Saccharomyces'' chromosome X to Chado database 'mychado', from a Unix command-line, do:
 
In summary, to load ''Saccharomyces'' chromosome X to Chado database 'mychado', from a Unix command-line, do:
Line 46: Line 27:
  
 
   mkdir data; cd data
 
   mkdir data; cd data
 
+
 
Fetch from NCBI, or this Indiana mirror  
+
Fetch from NCBI, or this Indiana mirror
  
 
   curl ftp://bio-mirror.net/biomirror/ncbigenomes/
 
   curl ftp://bio-mirror.net/biomirror/ncbigenomes/
Line 53: Line 34:
  
 
Other sample genomes of interest:
 
Other sample genomes of interest:
 
 
* Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
 
* Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
 
* Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
 
* M_musculus/CHR_19/mm_ref_chr19.gbk.gz
 
* H_sapiens/CHR_19/hs_ref_chr19.gbk.gz
 
  
 +
* <tt>Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz</tt>
 +
* <tt>Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz</tt>
 +
* <tt>Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz</tt>
 +
* <tt>M_musculus/CHR_19/mm_ref_chr19.gbk.gz</tt>
 +
* <tt>H_sapiens/CHR_19/hs_ref_chr19.gbk.gz</tt>
  
==Create GFF from the Genbank Files==
 
  
The [http://bioperl.org Bioperl] script <code>bp_genbank2gff3.pl</code> (scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF v3 suited to Chado loading. '''Important''': use a version of the script created April 2007 or later.
+
==Create GFF3 from the Genbank Files==
 +
 
 +
The [[BioPerl]] script <code>bp_genbank2gff3.pl</code> (<tt>scripts/Bio-DB-GFF/genbank2gff3.PLS</tt>) will convert to [[GFF3]] suited to [[Chado]] loading. '''Important''': use a version of the script created April 2007 or later.
  
 
The new <code>-noCDS</code> flag is required for this. Use <code>-s</code> flag to summarize features found.
 
The new <code>-noCDS</code> flag is required for this. Use <code>-s</code> flag to summarize features found.
 
+
 
 
   bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz
 
   bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz
 
 
  
==Load GFF into Chado==
 
  
Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
+
==Load GFF3 into Chado==
 +
 
 +
Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle '''one''' organism at a time. Chose the best <tt>--dbxref</tt> per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
  
 
   gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
 
   gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
 
+
 
 
Check data:
 
Check data:
  
Line 103: Line 84:
  
  
'''Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table'''
+
'''Your [[GFF3]] file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table'''
  
Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology].
+
Solution: This error message will be followed by [[Glossary#SQL|SQL]] statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology].
  
  
Line 111: Line 92:
 
'''CONTEXT:  COPY featureprop, line ...'''
 
'''CONTEXT:  COPY featureprop, line ...'''
  
Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF file. Correct these errors by hand and reload.
+
Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the [[GFF3]] file. Correct these errors by hand and reload.
  
 +
==Authors==
 +
 +
* [[User:Dongilbert|Don Gilbert]]
 +
* [[bp:Brian_Osborne|Brian Osborne]]
  
 
[[Category:HOWTO]]
 
[[Category:HOWTO]]
 
[[Category:Chado]]
 
[[Category:Chado]]

Latest revision as of 21:49, 30 December 2008

Abstract

This HOWTO describes how to load GenBank format files into Chado. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:

http://eugenes.org/gmod/genbank2chado/


Summary

  • Install prerequisites: latest versions of Chado and GBrowse
  • Fetch Genbank genome/chromosomes
  • Run genbank2gff3 script from BioPerl (Important: use a version of the script created April 2007 or later)
  • Run gmod_bulk_load_gff3.pl script (from GMOD)
  • View genome(s) with GBrowse (see an example at eugenes.org).

In summary, to load Saccharomyces chromosome X to Chado database 'mychado', from a Unix command-line, do:

 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
 | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata

Fetch Genbank Genome Files

Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/

 mkdir data; cd data

Fetch from NCBI, or this Indiana mirror

 curl ftp://bio-mirror.net/biomirror/ncbigenomes/
 curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz

Other sample genomes of interest:

  • Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
  • Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
  • Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
  • M_musculus/CHR_19/mm_ref_chr19.gbk.gz
  • H_sapiens/CHR_19/hs_ref_chr19.gbk.gz


Create GFF3 from the Genbank Files

The BioPerl script bp_genbank2gff3.pl (scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF3 suited to Chado loading. Important: use a version of the script created April 2007 or later.

The new -noCDS flag is required for this. Use -s flag to summarize features found.

 bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz


Load GFF3 into Chado

Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle one organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.

 gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff

Check data:

 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f group by f.organism_id;'
 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f where f.seqlen>0 group by f.organism_id;'


Possible Errors

It's possible that you'll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.


couldn't open /var/lib/gmod/conf directory for reading:No such file or directory

Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:

 setenv GMOD_ROOT /usr/local/gmod/ # tcsh

or

 set GMOD_ROOT=/usr/local/gmod/ # bash


Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table

Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.


DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...

Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF3 file. Correct these errors by hand and reload.

Authors