NOTE: We are working on migrating this site away from MediaWiki, so editing pages will be disabled for now.
Difference between revisions of "Load GFF Into Chado"
m (→Creating GFF3) |
m (→Creating GFF3) |
||
Line 91: | Line 91: | ||
− | '''DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"''' | + | '''DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"'''<br> |
'''CONTEXT: COPY featureprop, line ...''' | '''CONTEXT: COPY featureprop, line ...''' | ||
Revision as of 18:17, 18 March 2007
Contents
Abstract
This HOWTO describes a method for loading sequence annotation data in GFF format into the Chado database.
Authors
Copyright
This document is copyright Scott Cain , 2007. For reproduction other than personal use please contact <cain@cshl.edu>
Revision History
Revision 1.0 2007-03-16 BIO | First version |
Download the GFF Files
An easy way
to load data into the database is to use a GFF3 file and the script
load/bin/gmod_bulk_load_gff3.pl
. A good set of sample data is the GFF3 file prepared
by the nice folks at the Saccharomyces Genome Database:
ftp://ftp.yeastgenome.org/pub/yeast/data_download/chromosomal_feature/saccharomyces_cerevisiae.gff
This file contains Gene Ontology (GO) anotations, so if you didn't load
GO when you executed make ontologies
you will get many warning messages
about being unable to find entries in the dbxref table. If you want to
load GO you should be able to execute make ontologies
and select Gene Ontology
for installation.
Add an Entry for Your Organism
You will need to have an entry for your species in the Chado organism table. If you are unsure if this entry exists log into your database and execute this SQL command: <sql> select common_name from organism; </sql> If you do not see your organism listed, execute a command equivalent to this: <sql>
insert into organism (abbreviation, genus, species, common_name, organism_id) values ('S.cerevisiae', 'Saccharomyces', 'cerevisiae', 'yeast', 4932);
</sql>
Substitute in the appropriate values for your own organism if it's not yeast.
Load the GFF
Then execute gmod_bulk_load_gff3.pl:
>gmod_bulk_load_gff3.pl --organism yeast --gfffile saccharomyces_cerevisiae.gff
This loads the GFF3 file. The loading script requires GFF3 as it has tighter control of the syntax and requires the use of a controlled vocabulary (from Sequence Ontology Feature Annotation (SOFA)), allowing mapping to the relational schema. In addition to supplying the location of the file with the --gfffile
flag, the --organism
tag uses the common name (common_name
field) from the Chado organism table. Do perldoc gmod_bulk_load_gff.pl
for more information on adding other organisms and databases, as well as other available command line flags.
Note that gmod_load_gff3.pl
is also available, but is limited in how
much it has been supported and in how flexible it currently is. It is
a good example of how to write code using Class::DBI classes that are
created at the time of install. For more information on using these
classes, see http://sourceforge.net/projects/gmod-ware for a Class::DBI-based middleware/API.
Creating GFF3
GFF3 can also be generated using a script provided by Bioperl, scripts/Bio-DB-GFF/genbank2gff3.pl
(this script is currently preferred over the script of the same name found in the GMOD package). If your working directory contains a Genbank file you could use it like this:
>bp_genbank2gff3.pl --dir . --outdir .
This method for generating GFF3 files is not completely satisfactory and development is ongoing to provide better translation. However, by proceeding carefully you should be able to get it to produce GFF3 that can be loaded. Possible errors and their fixes are described below.
Unable to find srcfeature <some feature> in the database
Solution: Edit the '##sequence-region' 2nd line of the GFF3 output. Change it to '# sequence-region' is enough, or remove the line.
Your GFF3 file uses a tag called <term>, but this term is not already in the cvterm and dbxref tables so that it's value can be inserted into the featureprop table
This error message will be followed by SQL statements that insert the term in the correct way - execute them.
DBD::Pg::db pg_endcopy failed: ERROR: duplicate key violates unique constraint "featureprop_c1"
CONTEXT: COPY featureprop, line ...
The CONTEXT: line above is telling you what the offending data is. This error is telling you that there are 2 features sharing the same name or ID in the GFF file. Correct these errors by hand and reload.
This code needs to be tested. Please help improve this section with your tests.
More Information
See the related HOWTO Load RefSeq Into Chado.
Please send questions to the GMOD developers list:
gmod-devel@lists.sourceforge.net
Or contact the GMOD Help Desk