Load GenBank into Chado

Abstract

This HOWTO describes how to load GenBank format files into Chado. For a thorough discussion of this topic, including all the files that will allow you to set up a complete test environment see:

http://eugenes.org/gmod/genbank2chado/

Summary

Install prerequisites: latest versions of Chado and GBrowse
Fetch Genbank genome/chromosomes
Run genbank2gff3 script from BioPerl (Important: use a version of the script created April 2007 or later)
Run gmod_bulk_load_gff3.pl script (from GMOD)
View genome(s) with GBrowse (see an example at eugenes.org).

In summary, to load Saccharomyces chromosome X to Chado database 'mychado', from a Unix command-line, do:

 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
 | perl bp_genbank2gff3.pl -noCDS -in stdin -out stdout \
 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata

Fetch Genbank Genome Files

Genbank genome data is available from NCBI genomes section, ftp://ftp.ncbi.nih.gov/genomes, or from a current mirror at ftp://bio-mirror.net/biomirror/ncbigenomes/

 mkdir data; cd data

Fetch from NCBI, or this Indiana mirror

 curl ftp://bio-mirror.net/biomirror/ncbigenomes/
 curl -RO ftp://bio-mirror.net/biomirror/ncbigenomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk.gz

Other sample genomes of interest:

Drosophila_melanogaster/CHR_4/NC_004353.gbk.gz
Caenorhabditis_elegans/CHR_III/NC_003281.gbk.gz
Arabidopsis_thaliana/CHR_IV/NC_003075.gbk.gz
M_musculus/CHR_19/mm_ref_chr19.gbk.gz
H_sapiens/CHR_19/hs_ref_chr19.gbk.gz

Create GFF3 from the Genbank Files

The BioPerl script bp_genbank2gff3.pl (scripts/Bio-DB-GFF/genbank2gff3.PLS) will convert to GFF3 suited to Chado loading. Important: use a version of the script created April 2007 or later.

The new -noCDS flag is required for this. Use -s flag to summarize features found.

 bp_genbank2gff3.pl -noCDS -s -o . data/NC_001142.gbk.gz

Load GFF3 into Chado

Use the GMOD script gmod_bulk_load_gff3.pl for this. Note that gmod_bulk_load_gff3 will only handle one organism at a time. Chose the best --dbxref per organism (WormBase, SGD, MGI, FLYBASE), depending on contents of GenBank annotations. The 'GeneID' dbxref is standard for most GenBank genomes.

 gmod_bulk_load_gff3.pl  --dbname dev_chado_01c --dbxref GeneID --organism fromdata --gff data/NC_004353.gbk.gz.gff

Check data:

 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f group by f.organism_id;'
 psql -d dev_chado_01c -c 'select count(f.*), \
  (select common_name from organism where organism_id = f.organism_id) as species \
  from feature f where f.seqlen>0 group by f.organism_id;'

Possible Errors

It's possible that you'll run into some errors coming from the input data itself. Some of the errors, and their fixes, are described below.

couldn't open /var/lib/gmod/conf directory for reading:No such file or directory

Make sure the environmental variable GMOD_ROOT is set to where gmod was installed, for example:

 setenv GMOD_ROOT /usr/local/gmod/ # tcsh

or

 set GMOD_ROOT=/usr/local/gmod/ # bash

Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that its value can be inserted into the featureprop table

Solution: This error message will be followed by SQL statements that insert the term in the correct way - execute them. By the way, one explanation for this error is that the source sequence was curated but not with terms from the Sequence Ontology.

DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...

Solution: The CONTEXT line above is telling you what the offending data is. This error probably means that there are 2 features sharing the same name or ID and feature type in the GFF3 file. Correct these errors by hand and reload.

Authors

Load GenBank into Chado

Contents

Abstract

Summary

Fetch Genbank Genome Files

Create GFF3 from the Genbank Files

Load GFF3 into Chado

Possible Errors

Authors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Documentation

Community

Tools