SWISS-PROT RELEASE 35.0 RELEASE NOTES
1. INTRODUCTION
Release 35.0 of SWISS-PROT contains 69'113 sequence entries, comprising
25'083'768 amino acids abstracted from 59'101 references. This
represents an increase of 18.3% over release 34. The growth of the data
bank is summarized below.
Release Date Number of Number of amino
entries acids
2.0 09/86 3939 900 163
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
26.0 07/93 31808 10 875 091
27.0 10/93 33329 11 484 420
28.0 02/94 36000 12 496 420
29.0 06/94 38303 13 464 008
30.0 10/94 40292 14 147 368
31.0 02/95 43470 15 335 248
32.0 11/95 49340 17 385 503
33.0 02/96 52205 18 531 384
34.0 10/96 59021 21 210 389
35.0 11/97 69113 25 083 768
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 34
2.1 Sequences and annotations
10'189 sequences have been added since release 34, the sequence data of
1654 existing entries has been updated and the annotations of 15'683
entries have been revised.
2.2 What's happening with the model organisms
We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:
. Be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates;
. Provide a higher level of annotation;
. Provide cross-references to specialized database(s) that contain,
among other data, some genetic information about the genes that code
for these proteins;
. Provide specific indices or documents.
What was done since the last release or in preparation for the next
release concerning model organisms:
. We have added Methanoccocus jannaschii, Helicobacter pylori,
Synechocystis PCC 6803 to the list of model organisms. The genome of
these organisms has been completely sequenced and we plan to annotate
them fully in SWISS-PROT. Specific documents have been added (see
section 4) for each of these organisms.
. We also have added mouse (Mus musculus) as a model organism. A
significant effort has been done to add new mouse sequences (542 have
been added since the last release); we have added links to MGD (the
Mouse Genome Database; see section 2.4) and we also have created a
specific document (MGDTOSP.TXT) that lists the cross-references
between MGD and SWISS-PROT.
. We have continued our effort in catching up with the backlog of
sequences from other model organisms. In particular we added 410
entries from yeast, 644 from human, 89 from S.pombe, 527 from
C.elegans, 95 from A.thaliana and 92 from D.melanogaster.
. We have added in SWISS-PROT all the sequences from yeast chromosome
XIII. We plan to integrate data from the remaining chromosomes (IV,
XII, XV and XVI) very soon so as to have a complete set of annotated
yeast sequences.
. We have finished the annotation of all Mycoplasma genitalium entries.
. We plan to finish as quickly as possible the annotation of the
Escherichia coli and Haemophilus influenzae sequence entries which are
not yet part of SWISS-PROT.
Here is the current status of the model organisms in SWISS-PROT:
Organism Database Index file Number of
cross-referenced sequences
-------------- ---------------- -------------- ---------
A.thaliana None yet In preparation 658
B.subtilis SubtiList SUBTILIS.TXT 1882
C.albicans None yet CALBICAN.TXT 167
C.elegans Wormpep CELEGANS.TXT 1735
D.discoideum DictyDB DICTY.TXT 272
D.melanogaster FlyBase FLY.TXT 1002
E.coli EcoGene ECOLI.TXT 4098
H.influenzae HiDB (TIGR) HAEINFLU.TXT 1687
H.sapiens MIM MIMTOSP.TXT 4644
H.pylori HpDB (TIGR) HPYLORI.TXT 257
M.genitalium MgDB (TIGR) MGENITAL.TXT 470
M.musculus MGD MGDTOSP.TXT 2971
M.jannaschii MjDB (TIGR) MJANNASC.TXT 1064
M.tuberculosis None yet None yet 796
S.cerevisiae SGD YEAST.TXT 4750
S.typhimurium StyGene SALTY.TXT 680
S.pombe None yet POMBE.TXT 1045
S.solfataricus None yet None yet 42
Collectively the entries from the above model organisms represent 35.4%
of all SWISS-PROT entries.
2.3 Changes affecting the accession numbers
With the creation of the TrEMBL database (see section 6) and the rapid
increase in the amount of sequence data, we are faced with a problem of
availability of accession numbers. Currently we use a system based on a
one-letter prefix followed by 5 digits. This system was also used by the
nucleotide sequence databases which had originally reserved for SWISS-
PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run
out of space (due mainly to EST's), have been forced to start using a
new format based on a two-letter prefix followed by 6 digits.
We have used up all possible numbers with 'P' and 'Q' and the only
letter prefix which was not used by the nucleotide database is 'O'. As
we believe that changing the format of the accession numbers to that
used now by the nucleotide database would create havoc on the numerous
software packages using SWISS-PROT, we have decided to keep a system of
accession numbers based on a six-character code, but with the following
changes:
1) We have started using 'O'. This extra letter should allow the
continuation of the present format (1 prefix letter + 5 digits) for
approximately one year.
2) When we will have finished using up 'O', we will introduce a system
based on the following format:
1 2 3 4 5 6
[O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]
What the above means is that we will keep a six-character code, but that
in positions 3, 4 and 5 of this code any combination of letters and
numbers can be present. This format allows a total of 14 million
accession numbers (up from 300'000 with the current system).
We only allow numbers in positions 2 and 6 so that the SWISS-PROT
accession numbers can not be mistaken with gene names, acronyms, other
type of accession numbers or any type of words!
Examples: P0A3S2, Q2ASD4, O13YX2, P9B123
2.4 Introduction of a new CC line-type topic (DATABASE)
There are an increasing number of databases that cater for a specific
protein or a for a very limited number of proteins. Most of these
databases are mutation databases, reporting defects linked to a genetic
disease. We want to add cross-references to these databases when they
are available electronically, either by WWW or by FTP. We therefore
added in this release, a new comments (CC) line-type 'topic': "DATABASE"
whose syntax is the following:
CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][;
FTP="Address"].
Where
`NAME' is the name of the database;
`NOTE' (optional) is a free text note;
`WWW' (optional) is the WWW address (URL) of the database;
`FTP' (optional) is the anonymous FTP address (including the directory
name) where the database file(s) are stored.
Examples of its usage:
CC -!- DATABASE: NAME=CD40Lbase;
CC NOTE=European CD40L defect database (mutation db);
CC WWW="http://www.expasy.ch/www/cd40lbase.html";
CC FTP="ftp://www.expasy.ch/databases/cd40lbase".
CC -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry;
CC WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm".
Please note that this topic along with some forms of the DR lines (see
next section) are the first occurrence in SWISS-PROT of lower case
characters (yes, we plan to go to mixed cases soon!).
It is also, currently, the only part of SWISS-PROT where line longer
than 75 characters can be found as we do not reformat long URL or FTP
addresses.
2.5 Changes concerning cross-references (DR line)
2.5.1 TIGR
We have added cross-references from SWISS-PROT to the TIGR database, a
collection of genomic databases for microbes, plants and animals
maintained by The Institute for Genomic Research (TIGR) in Rockville,
Maryland, USA. These cross-references are present in the DR lines:
Data bank identifier : TIGR
Primary identifier : The genome Open Reading Frame (ORF) code
Secondary identifier : Not defined, a dash ("-") is stored in that field
Example : DR TIGR; HP1563; -.
2.5.2 MGD
We have added cross-references from SWISS-PROT to the Mouse Genome
Database (MGD), maintained by The Jackson Laboratory in Bar Harbor,
Maine, USA. These cross-references are present in the DR lines:
Data bank identifier : MGD
Primary identifier : The accession number
Secondary identifier : The gene designation
Example : DR MGD; MGI :109323; HTR2B.
2.5.3 LISTA
We have removed the cross-references from SWISS-PROT to the LISTA
database which is no longer maintained and which has been superseded by
the SGD database to which SWISS-PROT is fully cross-referenced.
2.5.4 PROSITE
The format for cross-references to the PROSITE protein domain and family
database used to be:
DR PROSITE; ACCESSION_NUMBER; ENTRY_NAME.
It has been changed to:
DR PROSITE; ACCESSION_NUMBER; ENTRY_NAME; STATUS.
Where 'ACCESSION_NUMBER' stands for the accession number of the PROSITE
pattern or profile entry; "ENTRY_NAME" is the name of the entry and
`STATUS' is one of the following:
n
FALSE_NEG
PARTIAL
UNKNOWN_n
Where "n" is the number of hits of the pattern or profile in that
particular protein sequence. The "FALSE_NEG" status indicates that while
the pattern or profile did not detect the protein sequence, it is a
member of that particular family or domain. The "PARTIAL" status
indicates that the pattern or profile did not detect the sequence
because that sequence is not complete and lacks the region on which is
based the pattern/profile. Finally the "UNKNOWN" status indicates
uncertainties as to the fact that the sequence is a member of the family
or domain described by the pattern/profile.
Example of PROSITE cross-references:
DR PROSITE; PS00107; PROTEIN_KINASE_ATP; 1.
DR PROSITE; PS00028; ZINC_FINGER_C2H2; 6.
DR PROSITE; PS00237; G_PROTEIN_RECEPTOR; FALSE_NEG.
DR PROSITE; PS01128; SHIKIMATE_KINASE; PARTIAL.
2.5.5 REBASE
Two small changes have been made to the syntax of cross-references to
the REBASE database:
- REBASE has recently changed its accession numbers to add an additional
digit (an extra leading zero).
- We are now using mixed case characters in the secondary identifier
(the name of the restriction system) so as to represent exactly the
information as stored in REBASE.
Example:
DR REBASE; RB0005; ECORI.
has been changed to:
DR REBASE; RB00005; EcoRI.
3. PLANNED CHANGES
3.1 Extension of the accession number system
As already explained in detail under 2.3, we will extend the accession
number system when we will have used up the 'O' series of accession
numbers. This can be anticipated for October 1998.
3.2 Switch to the NCBI taxonomy
To standardize the taxonomies used by different databases we will change
with release 37 our taxonomy. We will switch to the NCBI taxonomy, which
is already used as the common taxonomy by the DDBJ/EMBL/GenBank
nucleotide sequence databases.
3.3 Introduction of RT lines
With release 37 we will introduce a new line type, the RT (Reference
Title) line. This optional line will be placed between the RA and RL
line. The RT line gives the title of the paper (or other work) as
exactly as possible given the limitations of the computer character set.
The form which will be used is that which would be used in a citation
rather than displayed at the top of the published paper. For instance,
where journals capitalize major title words this is not preserved. The
title is enclosed in double quotes, and may be continued over several
lines as necessary. The title lines are terminated by a semicolon. An
example of the use of RT lines is shown below:
RT "Sequence analysis of the genome of the unicellular cyanobacterium
RT Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb
RT region from map positions 64% to 92% of the genome.";
4. STATUS OF THE DOCUMENTATION FILES
SWISS-PROT is distributed with a large number of documentation files.
Some of these files have been available for a long time (the user
manual, release notes, the various indices for authors, citations,
keywords, etc.), but many have been created recently and we are
continuously adding new files. Since release 34, we have added 15 new
document files. The following table lists all the documents that are
currently available.
USERMAN.TXT User manual
RELNOTES.TXT Release notes
SHORTDES.TXT Short description of entries in SWISS-PROT
JOURLIST.TXT List of abbreviations for journals cited
KEYWLIST.TXT List of keywords in use
SPECLIST.TXT List of organism identification codes
TISSLIST.TXT List of tissues
EXPERTS.TXT List of on-line experts for PROSITE and SWISS-PROT
SUBMIT.TXT Submission of sequence data to SWISS-PROT
ACINDEX.TXT Accession number index
AUTINDEX.TXT Author index
CITINDEX.TXT Citation index
KEYINDEX.TXT Keyword index
SPEINDEX.TXT Species index
DELETEAC.TXT Deleted accession number index [1]
7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries
AATRNASY.TXT List of aminoacyl-tRNA synthetases
ALLERGEN.TXT Nomenclature and index of allergen sequences
BLOODGRP.TXT List of blood group antigen proteins [1]
CALBICAN.TXT Index of Candida albicans entries and their
corresponding gene designations
CDLIST.TXT CD nomenclature for surface proteins of human
leucocytes
CELEGANS.TXT Index of Caenorhabditis elegans entries and their
corresponding gene Wormpep cross-references
DICTY.TXT Index of Dictyostelium discoideum entries and
their corresponding gene designations and DictyDb
cross-references
EC2DTOSP.TXT Index of Escherichia coli Gene-protein database
entries referenced in SWISS-PROT
ECOLI.TXT Index of Escherichia coli K12 chromosomal entries
and their corresponding EcoGene cross-references
EMBLTOSP.TXT Index of EMBL Database entries referenced in
SWISS-PROT
EXTRADOM.TXT Nomenclature of extracellular domains
FLY.TXT Index of Drosophila entries and FlyBase cross-
references [1]
GLYCOSID.TXT Classification of glycosyl hydrolase families and
index of glycosyl hydrolase entries
HAEINFLU.TXT Index of Haemophilus influenzae RD chromosomal
entries
HOXLIST.TXT Vertebrate homeotic Hox proteins: nomenclature and
index
HPYLORI.TXT Index of Helicobacter pylori strain 26695
chromosomal entries [1]
HUMCHR18.TXT Index of protein sequence entries encoded on human
chromosome 18 [1]
HUMCHR19.TXT Index of protein sequence entries encoded on human
chromosome 19 [1]
HUMCHR20.TXT Index of protein sequence entries encoded on human
chromosome 20
HUMCHR21.TXT Index of protein sequence entries encoded on human
chromosome 21
HUMCHR22.TXT Index of protein sequence entries encoded on human
chromosome 22
HUMCHRX.TXT Index of protein sequence entries encoded on human
chromosome X
HUMCHRY.TXT Index of protein sequence entries encoded on human
chromosome Y
INITFACT.TXT List and index of translation initiation factors [1]
MIMTOSP.TXT Index of MIM entries referenced in SWISS-PROT
METALLO.TXT Classification of metallothioneins and index of
entries in SWISS-PROT [1]
MGDTOSP.TXT Index of MGD entries referenced in SWISS-PROT [1]
MGENITAL.TXT Index of Mycoplasma genitalium chromosomal entries
[1]
MJANNASC.TXT Index of Methanococcus jannaschii entries [1]
NGR234.TXT Table of putative genes in Rhizobium plasmid
pNGR234a [1]
NOMLIST.TXT List of nomenclature related references for
proteins
PCC6803.TXT Index of Synechocystis strain PCC 6803 entries [1]
PDBTOSP.TXT Index of X-ray crystallography Protein Data Bank
(PDB) entries referenced in SWISS-PROT
PEPTIDAS.TXT Classification of peptidase families and index of
peptidase entries
PLASTID.TXT List of chloroplast and cyanelle encoded proteins
POMBE.TXT Index of Schizosaccharomyces pombe entries in
SWISS-PROT and their corresponding gene
designations
RESTRIC.TXT List of restriction enzyme and methylase entries
RIBOSOMP.TXT Index of ribosomal proteins classified by families
on the basis of sequence similarities
SALTY.TXT Index of Salmonella typhimurium LT2 chromosomal
entries and their corresponding StyGene cross-
references
SUBTILIS.TXT Index of Bacillus subtilis 168 chromosomal entries
and their corresponding SubtiList cross-references
UPFLIST.TXT UPF (Uncharacterized Protein Families) list and
index of members [1]
YEAST.TXT Index of Saccharomyces cerevisiae entries and
their corresponding gene designations
YEAST1.TXT Yeast Chromosome I entries
YEAST2.TXT Yeast Chromosome II entries
YEAST3.TXT Yeast Chromosome III entries
YEAST5.TXT Yeast Chromosome V entries
YEAST6.TXT Yeast Chromosome VI entries
YEAST7.TXT Yeast Chromosome VII entries
YEAST8.TXT Yeast Chromosome VIII entries
YEAST9.TXT Yeast Chromosome IX entries
YEAST10.TXT Yeast Chromosome X entries
YEAST11.TXT Yeast Chromosome XI entries
YEAST13.TXT Yeast Chromosome XIII entries [1]
YEAST14.TXT Yeast Chromosome XIV entries
Notes:
[1] New in release 35.
We have continued to include in some SWISS-PROT document files the
references of World-Wide Web sites relevant to the subject under
consideration. There are now 12 documents that include such links.
5. THE EXPASY WORLD-WIDE WEB SERVER
5.1 Background information
The most efficient and user-friendly way to browse interactively in
SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
the World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy
server was made available to the public in September 1993, it is
reachable at the following address:
http://www.expasy.ch/
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE
and CD40Lbase databases and, through any SWISS-PROT protein sequence
entry, to other databases such as EMBL, Eco2DBASE, EcoCyc, FlyBase,
GCRDb, MaizeDB, SubtiList/NRSub, OMIM, PDB, HSSP, ProDom, REBASE, SGD,
YEPD and Medline. ExPAsy also offers many tools for the analysis of
protein sequences and 2D gels.
5.2 SWISS-SHOP
We provide, on ExPASy, a service called SWISS-SHOP. SWISS-Shop allows
any users of SWISS-PROT to indicate what proteins he/she is interested
in. This can be done using various criteria that can be combined:
- By entering one or more words that should be present in the
description line;
- By entering one or more species name(s) or taxonomic division(s);
- By entering one or more keywords;
- By entering one or more author names;
- By entering the accession number (or entry name) of a PROSITE
pattern or a user-defined sequence pattern;
- By entering the accession number (or entry name) of an existing
SWISS-PROT entry or by entering a "private" sequence.
Every week, the new sequences entered in SWISS-PROT are automatically
compared with all the criteria that have been defined by the users. If a
sequence corresponds to the selection criteria defined by a user, that
sequence is sent by electronic mail.
5.3 What is new on ExPASy
ExPASy is constantly modified and improved. If you wish to be informed
on the changes made to the server you can either:
- Read the document "History of changes, improvements and new
features" which is available at the address:
http://www.expasy.ch/www/history.html
- Subscribe to SWISS-Flash, a service that reports news of databases,
software and services developments. By subscribing to this service,
you will automatically get SWISS-Flash bulletins by electronic
mail. To subscribe use the address:
http://www.expasy.ch/www/swiss-flash.html
6. TREMBL - A SUPPLEMENT TO SWISS-PROT
The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-
PROT. Since we do not want to dilute the quality standards of SWISS-PROT
by incorporating sequences into the database without proper sequence
analysis and annotation, we cannot speed up the incorporation of new
incoming data indefinitely. But as we also want to make the sequences
available as fast as possible, we have introduced with SWISS-PROT a
computer annotated supplement. This supplement consists of entries in
SWISS-PROT-like format derived from the translation of all coding
sequences (CDS) in the EMBL nucleotide sequence database, except those
already included in SWISS-PROT.
We name this supplement TrEMBL (Translation from EMBL). It can be
considered as a preliminary section of SWISS-PROT. This SWISS-PROT
release is supplemented by TrEMBL release 5. TrEMBL is split in two main
sections; SP-TrEMBL and REM-TrEMBL:
- SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (140'555 in release
5) which should be incorporated into SWISS-PROT. SWISS-PROT accession
numbers have been assigned for all SP-TrEMBL entries.
- REM-TrEMBL (REMaining TrEMBL) contains the entries (25'806 in release
5) that we do not want to include in SWISS-PROT for a variety of
reasons (synthetic sequences, pseudogenes, translations of uncorrect
open reading frames, fragments with less than eight amino acids,
patent-derived sequences, immunoglobulins and T-cell receptors, etc.)
TrEMBL is available by FTP from the EBI server (ftp.ebi.ac.uk) in the
directory '/pub/databases/trembl'. It can be queried on WWW by the EBI
SRS server (http://www.ebi.ac.uk/). It is also available on the SWISS-
PROT CD-ROM and is searchable on the FASTA, BIC_SW and BLAST servers of
the EBI.
7. WEEKLY UPDATES OF SWISS-PROT
Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are updated at each update:
new_seq.dat Contains all the new entries since the last full release;
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release;
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Currently these files are available on the following anonymous FTP
servers:
Organization ExPASy (Geneva University Expert Protein Analysis System)
Address expasy.hcuge.ch (or 129.195.254.61)
Directory /databases/swiss-prot/updates
Organization European Bioinformatics Institute (EBI)
Address ftp.ebi.ac.uk (or 193.62.196.6)
Directory /pub/databases/swissprot/new
!! Important notes !!!
- Although we try to follow a regular schedule, we do not promise to
update these files every week. In some cases two weeks will elapse in-
between two updates.
- Due to the current mechanism used to build a release the entries that
are provided in these updates are not guaranteed to be error free.
8. ENZYME and PROSITE
8.1 The ENZYME data bank
Release 22.0 of the ENZYME data bank is distributed with release 35 of
SWISS-PROT. ENZYME release 22.0 contains information relative to 3651
enzymes.
8.2 The PROSITE data bank
Release 14.0 of the PROSITE data bank is distributed with release 35 of
SWISS-PROT. This release of PROSITE contains 997 documentation entries
that describe 1'335 different patterns, rules and profiles/matrices.
Release 14.0 is the first completely new release of PROSITE since
November 1995. Since that date we have added 114 entries and modified
566 entries. The long time that elapsed between this release of PROSITE
and the last one is partially due to a complete rewriting of the
software tools that maintain the database and allows it be bi-
directionally inked to SWISS-PROT. Thanks to those changes, we will now
be able to produce PROSITE releases at each release of SWISS-PROT and
also to offer on the ExPASy server frequent updates of the database.
9. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new post-translational information
has become available. To facilitate such feedback's we offer on the
ExPASY WWW server a form that allows the submission of updates and/or
corrections to SWISS-PROT:
http://www.expasy.ch/sprot/sp_update_form.html
It is also possible, from any entries in SWISS-PROT displayed by the
ExPASy server, to submit updates and/or corrections for that particular
entry. Finally, you can also send your comments by electronic mail to
the address:
swiss-prot@expasy.ch
========================================================================
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.57 Gln (Q) 4.00 Leu (L) 9.39 Ser (S) 7.15
Arg (R) 5.15 Glu (E) 6.34 Lys (K) 5.95 Thr (T) 5.70
Asn (N) 4.50 Gly (G) 6.83 Met (M) 2.35 Trp (W) 1.24
Asp (D) 5.29 His (H) 2.23 Phe (F) 4.08 Tyr (Y) 3.18
Cys (C) 1.68 Ile (I) 5.78 Pro (P) 4.91 Val (V) 6.55
Asx (B) 0.001 Glx (Z) 0.001 Xaa (X) 0.01
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 5713
The first twenty species represent 34020 sequences: 49.2 % of the total
number of entries.
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 2609
2x: 891
3x: 480
4x: 321
5x: 225
6x: 209
7x: 148
8x: 94
9x: 113
10x: 58
11- 20x: 261
21- 50x: 165
51-100x: 64
>100x: 75
A.2.2 Table of the most represented species
Number Frequency Species
1 4750 Baker's yeast (Saccharomyces cerevisiae)
2 4644 Human
3 4098 Escherichia coli
4 2971 Mouse
5 2398 Rat
6 1882 Bacillus subtilis
7 1735 Caenorhabditis elegans
8 1687 Haemophilus influenzae
9 1064 Methanococcus jannaschii
10 1047 Bovine
11 1045 Fission yeast (Schizosaccharomyces pombe)
12 1002 Fruit fly (Drosophila melanogaster)
13 799 Chicken
14 786 Mycobacterium tuberculosis
15 680 Salmonella typhimurium
16 658 Arabidopsis thaliana (Mouse-ear cress)
17 648 African clawed frog (Xenopus laevis)
18 551 Pig
19 541 Rabbit
20 494 Synechocystis sp. (strain PCC 6803)
21 489 Mycoplasma pneumoniae
22 470 Mycoplasma genitalium
23 403 Rhizobium sp. (strain NGR234)
24 398 Maize
25 340 Pseudomonas aeruginosa
26 292 Rice
27 273 Bacteriophage T4
28 272 Slime mold (Dictyostelium discoideum)
29 257 Helicobacter pylori
30 256 Tobacco
31 253 Vaccinia virus (strain Copenhagen)
32 248 Dog
33 231 Pea
34 223 Sheep
35 219 Porphyra purpurea
36 209 Barley
37 203 Neurospora crassa
38 199 Wheat
199 Staphylococcus aureus
40 196 Mycobacterium leprae
41 193 Human cytomegalovirus (strain AD169)
42 192 Soybean
43 190 Klebsiella pneumoniae
44 184 Vaccinia virus (strain WR)
45 183 Rhodobacter capsulatus
183 Pseudomonas putida
47 180 Bacillus stearothermophilus
48 175 Potato
49 174 Tomato
50 167 Candida albicans
51 162 Agrobacterium tumefaciens
52 156 Spinach
53 154 Rhizobium meliloti
154 Autographa californica nuclear polyhedrosis virus
55 151 Chlamydomonas reinhardtii
56 150 Marchantia polymorpha (Liverwort)
57 149 Guinea pig
58 146 Variola virus
59 145 Cyanophora paradoxa
60 139 Odontella sinensis
61 138 Aspergillus nidulans
62 134 Orgyia pseudotsugata multicapsid polyhedrosis virus
63 132 Lactococcus lactis (subsp. lactis)
64 131 Streptomyces coelicolor
65 122 Thermus aquaticus (subsp. thermophilus)
66 120 Horse
67 116 Golden hamster
68 113 Trypanosoma brucei brucei
113 Anabaena sp. (strain PCC 7120)
113 Synechococcus sp. (strain PCC 7942)
71 108 Kluyveromyces lactis
72 107 Bombyx mori (Silk moth)
73 105 Bradyrhizobium japonicum
105 Alcaligenes eutrophus
75 102 Yersinia enterocolitica
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 2882 1001-1100 627
51- 100 5886 1101-1200 484
101- 150 8453 1201-1300 339
151- 200 6661 1301-1400 226
201- 250 6184 1401-1500 186
251- 300 5742 1501-1600 115
301- 350 5369 1601-1700 102
351- 400 5392 1701-1800 79
401- 450 4149 1801-1900 86
451- 500 3905 1901-2000 52
501- 550 2927 2001-2100 30
551- 600 2053 2101-2200 67
601- 650 1560 2201-2300 64
651- 700 1159 2301-2400 32
701- 750 1032 2401-2500 39
751- 800 831 >2500 203
801- 850 652
851- 900 685
901- 950 464
951-1000 396
A.4 Longest sequences
The longest sequences (>=4000 residues) are listed here:
HTS1_COCCA 5217
MUC2_HUMAN 5179
FAT_DROME 5147
RYNR_RABIT 5037
RYNR_PIG 5035
RYNR_HUMAN 5032
RYNC_RABIT 4969
LRP_CAEEL 4753
DYHC_DICDI 4725
PLEC_RAT 4687
LRP2_RAT 4660
DYHC_RAT 4644
DYHC_DROME 4639
DYHC_CAEEL 4568
DYHB_CHLRE 4568
APB_HUMAN 4563
APOA_HUMAN 4548
LRP1_HUMAN 4544
LRP1_CHICK 4543
DYHC_PARTE 4540
RRPA_CVMJH 4488
DYHG_CHLRE 4485
DYHC_ANTCR 4466
DYHC_TRIGR 4466
GRSB_BACBR 4451
PKSK_BACSU 4447
PKSL_BACSU 4427
PGBM_HUMAN 4393
YP73_CAEEL 4385
DYHC_NEUCR 4367
DYHC_NECHA 4349
DYHC_EMENI 4344
PKD1_HUMAN 4303
DYHC_YEAST 4092
RRPA_CVH22 4085
A.5 Statistics for journal citations
Total number of journals cited in this release of SWISS-PROT: 861
A.5.1 Table of the frequency of journal citations
Journals cited 1x: 326
2x: 117
3x: 61
4x: 39
5x: 30
6x: 23
7x: 14
8x: 13
9x: 10
10x: 12
11- 20x: 66
21- 50x: 58
51-100x: 23
>100x: 69
A.5.2 List of the most cited journals in SWISS-PROT
Citations Journal abbreviation
--------- ----------------------------------
6038 J. BIOL. CHEM.
3672 PROC. NATL. ACAD. SCI. U.S.A.
3356 NUCLEIC ACIDS RES.
2604 J. BACTERIOL.
2352 GENE
1992 FEBS LETT.
1853 EUR. J. BIOCHEM.
1693 BIOCHEM. BIOPHYS. RES. COMMUN.
1651 EMBO J.
1596 BIOCHEMISTRY
1540 NATURE
1367 BIOCHIM. BIOPHYS. ACTA
1244 J. MOL. BIOL.
1177 CELL
1137 MOL. CELL. BIOL.
920 MOL. GEN. GENET.
899 PLANT MOL. BIOL.
850 BIOCHEM. J.
764 SCIENCE
750 VIROLOGY
748 GENOMICS
731 MOL. MICROBIOL.
661 J. BIOCHEM.
502 J. VIROL.
444 J. CELL BIOL.
439 YEAST
435 J. GEN. VIROL.
418 PLANT PHYSIOL.
381 GENES DEV.
333 HUM. MOL. GENET.
323 J. IMMUNOL.
313 CURR. GENET.
305 ARCH. BIOCHEM. BIOPHYS.
303 INFECT. IMMUN.
287 ONCOGENE
287 MOL. BIOCHEM. PARASITOL.
262 BIOL. CHEM. HOPPE-SEYLER
248 FEMS MICROBIOL. LETT.
230 MOL. ENDOCRINOL.
230 HUM. MUTAT.
220 J. CLIN. INVEST.
220 AM. J. HUM. GENET.
219 NAT. GENET.
219 DEVELOPMENT
216 J. GEN. MICROBIOL.
213 HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
194 J. MOL. EVOL.
185 GENETICS
180 STRUCTURE
178 MICROBIOLOGY
177 BLOOD
172 HUM. GENET.
169 DNA CELL BIOL.
168 J. EXP. MED.
163 APPL. ENVIRON. MICROBIOL.
158 DEV. BIOL.
156 NEURON
152 DNA
136 IMMUNOGENETICS
124 ENDOCRINOLOGY
123 DNA SEQ.
122 PLANT CELL
115 NAT. STRUCT. BIOL.
109 HEMOGLOBIN
108 PROTEIN SCI.
108 BIOCHIMIE
106 AGRIC. BIOL. CHEM.
105 BIOORG. KHIM.
101 CANCER RES.
===========================================================================
APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR
DATABASES
The current status of the relationships (cross-references) between
SWISS-PROT and some biomolecular databases is shown in the following
schematic:
***********************
* EMBL Nucleotide *
* Sequence Database *
* [EBI] *
***********************
^ ^ ^ ^ ^ ^ ^ ^ ^
****************** | | | I | | | | | **********************
* FlyBase * <-------+ | | I | | | | +-------> * MGD [Mouse] *
****************** | | | I | | | | | **********************
| | | I | | | | |
****************** | | | I | | | | | **********************
* SubtiList * <---------+ | I | | | +---------> * GCRDb [7TM recep.] *
* [B.subtilis] * | | | I | | | | | **********************
****************** | | | I | | | | |
| | | I | | | | | **********************
****************** | | | I | | +-----------> * EcoGene [E.coli] *
* Mendel [Plant] * <-----+ | | | I | | | | | **********************
****************** | | | | I | | | | |
| | | | I | | | | | **********************
****************** | | | | I +---------------> * SGD [Yeast] *
* MaizeDb * <-----------+ I | | | | | **********************
* [Zea mays] * | | | | I | | | | |
****************** | | | | I | | | | | **********************
| | | | I | +-------------> * DictyDB [D.disco.] *
****************** | | | | I | | | | | **********************
* WormPep * | | | | I | | | | |
* [C.elegans] * <---+ | | | | I | | | | | **********************
****************** | | | | | I | | | | | +-----> * ENZYME [Nomencl.] *
| | | | | I | | | | | | **********************
****************** | v v v v v v v v v v v v
* REBASE * ************************* **********************
* [Restriction * <-- * SWISS-PROT * ----> * OMIM [Human] *
* enzymes] * * Protein Sequence * **********************
****************** * Data Bank *
************************* **********************
****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * ECO2DBASE [2D] *
* StyGene * | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
****************** | | | | | | | | | **********************
| | | | | | | | +----------> * Maize-2DPAGE [2D] *
****************** | | | | | | | | **********************
* Transfac * <------+ | | | | | | |
****************** | | | | | | | **********************
| | | | | | +------------> * SWISS-2DPAGE [2D] *
****************** | | | | | | **********************
* Harefield [2D] * <--------+ | | | | |
****************** | | | | | **********************
| | | | +--------------> * Aarhus/Ghent [2D] *
****************** | | | | **********************
* PROSITE * | | | |
* [Patterns and * <----------+ | | +----------------> **********************
* profiles] * | | * YEPD [Yeast] [2D] *
****************** | +----------------+ **********************
| v |
| *********************** +-> **********************
+--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
*********************** **********************
===End=of=SWISS-PROT=release=35=notes=====================================