Difference between revisions of "Load GenBank into Chado"

Latest revision as of 21:49, 30 December 2008

Abstract

This HOWTO describes how to load GenBank format files into Chado. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:

http://eugenes.org/gmod/genbank2chado/

Summary

Install prerequisites: latest versions of Chado and GBrowse
Fetch Genbank genome/chromosomes
Run genbank2gff3 script from BioPerl (Important: use a version of the script created April 2007 or later)
Run gmod_bulk_load_gff3.pl script (from GMOD)
View genome(s) with GBrowse (see an example at eugenes.org).

In summary, to load Saccharomyces chromosome X to Chado database 'mychado', from a Unix command-line, do:

 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
 | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata

Fetch Genbank Genome Files

Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/

 mkdir data; cd data

Fetch from NCBI, or this Indiana mirror

 curl ftp://bio-mirror.net/biomirror/ncbigenomes/
 curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz

Other sample genomes of interest:

Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
M_musculus/CHR_19/mm_ref_chr19.gbk.gz
H_sapiens/CHR_19/hs_ref_chr19.gbk.gz

Create GFF3 from the Genbank Files

The BioPerl script bp_genbank2gff3.pl (scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF3 suited to Chado loading. Important: use a version of the script created April 2007 or later.

The new -noCDS flag is required for this. Use -s flag to summarize features found.

 bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz

Load GFF3 into Chado

Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle one organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.

 gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff

Check data:

 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f group by f.organism_id;'
 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f where f.seqlen>0 group by f.organism_id;'

Possible Errors

It's possible that you'll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.

couldn't open /var/lib/gmod/conf directory for reading:No such file or directory

Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:

 setenv GMOD_ROOT /usr/local/gmod/ # tcsh

or

 set GMOD_ROOT=/usr/local/gmod/ # bash

Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table

Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.

DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...

Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF3 file. Correct these errors by hand and reload.

Authors

@@ Line 3: / Line 3: @@
 ==Abstract==
-This HOWTO describes how to load GenBank format files into [[Chado]]. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:
+This [[:Category:HOWTO|HOWTO]] describes how to load GenBank format files into [[Chado]]. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:
-http://eugenes.org/gmod/genbank2chado/
+:http://eugenes.org/gmod/genbank2chado/
-==Authors==
-* [[Don Gilbert]]
-* [[bp:Brian_Osborne|Brian Osborne]]
-==Copyright==
-This document is copyright Don Gilbert, 2007. For reproduction other than personal use please contact <gilbertd@cricket.bio.indiana.edu>
-==Revision History==
-{| border="1" cellspacing="0" cellpadding="4"
-|-
-| Revision 1.0 2007-04-16 BIO
-| First version
-|-
-|}
 ==Summary==
@@ Line 31: / Line 12: @@
 * Install prerequisites: latest versions of [[Chado]] and [[GBrowse]]
 * Fetch Genbank genome/chromosomes
-* Run genbank2gff3 script from [http://bioperl.org Bioperl] (Important: use a version of the script created April 2007 or later)
+* Run <tt>[http://code.open-bio.org/svnweb/index.cgi/bioperl/view/bioperl-live/trunk/scripts/Bio-DB-GFF/genbank2gff3.PLS genbank2gff3]</tt> script from [[BioPerl]] (Important: use a version of the script created April 2007 or later)
-* Run bulk_load_gff3 script (from GMOD)
+* Run <tt>gmod_bulk_load_gff3.pl</tt> script (from GMOD)
-* View genome(s) with GBrowse (see an example here: http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/)
+* View genome(s) with [[GBrowse]] (see an [http://server3.eugenes.org/cgi-bin/gmod01/gbrowse/dev_chado_ggb/ example at eugenes.org]).
 In summary, to load ''Saccharomyces'' chromosome X to Chado database 'mychado', from a Unix command-line, do:
@@ Line 46: / Line 27: @@
    mkdir data; cd data
 Fetch from NCBI, or this Indiana mirror
    curl ftp://bio-mirror.net/biomirror/ncbigenomes/
@@ Line 53: / Line 34: @@
 Other sample genomes of interest:
-* Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
-* Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
-* Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
-* M_musculus/CHR_19/mm_ref_chr19.gbk.gz
-* H_sapiens/CHR_19/hs_ref_chr19.gbk.gz
+* <tt>Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz</tt>
+* <tt>Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz</tt>
+* <tt>Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz</tt>
+* <tt>M_musculus/CHR_19/mm_ref_chr19.gbk.gz</tt>
+* <tt>H_sapiens/CHR_19/hs_ref_chr19.gbk.gz</tt>
-==Create GFF from the Genbank Files==
-The [http://bioperl.org Bioperl] script <code>bp_genbank2gff3.pl</code> (scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF v3 suited to Chado loading. '''Important''': use a version of the script created April 2007 or later.
+==Create GFF3 from the Genbank Files==
+The [[BioPerl]] script <code>bp_genbank2gff3.pl</code> (<tt>scripts/Bio-DB-GFF/genbank2gff3.PLS</tt>) will convert to [[GFF3]] suited to [[Chado]] loading. '''Important''': use a version of the script created April 2007 or later.
 The new <code>-noCDS</code> flag is required for this. Use <code>-s</code> flag to summarize features found.
    bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz
-==Load GFF into Chado==
-Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle ONE organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
+==Load GFF3 into Chado==
+Use the GMOD script <code>gmod_bulk_load_gff3.pl</code> for this. Note that <code>gmod_bulk_load_gff3</code> will only handle '''one''' organism at a time. Chose the best <tt>--dbxref</tt> per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.
    gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff
 Check data:
@@ Line 103: / Line 84: @@
-'''Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table'''
+'''Your [[GFF3]] file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table'''
-Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology].
+Solution: This error message will be followed by [[Glossary#SQL|SQL]] statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the [http://sequenceontology.org Sequence Ontology].
@@ Line 111: / Line 92: @@
 '''CONTEXT:  COPY featureprop, line ...'''
-Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF file. Correct these errors by hand and reload.
+Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the [[GFF3]] file. Correct these errors by hand and reload.
+==Authors==
+* [[User:Dongilbert|Don Gilbert]]
+* [[bp:Brian_Osborne|Brian Osborne]]
 [[Category:HOWTO]]
 [[Category:Chado]]

Difference between revisions of "Load GenBank into Chado"

Latest revision as of 21:49, 30 December 2008

Contents

Abstract

Summary

Fetch Genbank Genome Files

Create GFF3 from the Genbank Files

Load GFF3 into Chado

Possible Errors

Authors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Documentation

Community

Tools