Instructions
What | Comment |
Principle | The build a model (BUILD) method expects an annotated genome as one of its arguments. Two input formats are supported: EMBL and GenBank formats.
- If a genome is made of one or several entities i.e. contigs, scaffolds, chromosomes or plasmids, a single file containing the concatenated entries can be supplied.
- Zipped or gzipped genome file upload is supported but without any guarantee.
|
Data sources | Genomes from the following public resources have been tested and include numerous species
- EMBL-formatted genomes can be obtained from the Genomes Pages of ENA for many species of bacteria and archaea (column: Sequence / Plain).
- EMBL-formatted genomes can be obtained from the Ensembl Genomes FTP site
- GenBank-formatted genomes can be obtained from the Assembly pages of NCBI for many species of bacteria and archaea (follow the link Download the GenBank assembly or Download the RefSeq assembly and choose the file with the _genomic.gbff.gz extension)
|
Extracted fields | The following fields are extracted from the uploaded genome:
- The ID of every entry is read from the ID or LOCUS field
- The protein sequences are obtained from the /translation sub-field of the CDS field
- The locations and strand of genes are retrieved from the gene field
- A protein identifier is created from the CDS field by looking for one of the /locus_tag, /gene and /protein_id qualifiers
- Other informations are currently ignored
|
Nota Bene |
- Many software tools produce pseudo-EMBL or -GenBank formats: these fancy formats are possibly not accepted here and no support will be given.
- For a given gene, the translated CDS sequences from an entry and the protein sequence from the corresponding UniProt entry are not necessarily identical! The EMBL and the GenBank databases act as an archive which preserves the genome annotations as originally deposited, while UniProt is a curated resource that is regularly updated.
|