SWISS-PROT RELEASE 30.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 30.0 of SWISS-PROT contains 40292 sequence entries, comprising
14'147'368 amino acids abstracted from 37887 references. This represents
an increase of 5.1% over release 29. The recent growth of the data bank
is summarized below.
Release Date Number of entries Nb of amino acids
2.0 09/86 3939 900 163
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
26.0 07/93 31808 10 875 091
27.0 10/93 33329 11 484 420
28.0 02/94 36000 12 496 420
29.0 06/94 38303 13 464 008
30.0 10/94 40292 14 147 368
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 29
2.1 Sequences and annotations
About 2005 sequences have been added since release 29, the sequence data
of 296 existing entries has been updated and the annotations of 5260
entries have been revised.
<PAGE>
We are continuing the process to 'clean-up' the various representations
of domains in the feature lines (especially the usage of the feature
keys "DOMAIN", "REPEAT", "DNA_BIND", and "SITE"). We also have undertook
an overall revision of the CC topics "SUBCELLULAR LOCATION", "SUBUNIT
and "CAUTION".
2.2 What's happening with the model organisms
As we announced in the last three releases we have selected a number of
organisms that are the target of genome sequencing and/or mapping
projects and for which we intend to:
- Be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates.
- Provide a high level of annotations.
- Cross-references to specialized database(s) that contain, among other
data, some genetic information about the genes that code for these
proteins.
- Provide specific indices or documents.
What was done since the last release or in preparation for the next
release:
- We have added Bacillus subtilis as a model organism. We will soon
link SWISS-PROT to the SubtiList database currently being designed by
Ivan Moszer of the Pasteur Institute in Paris.
- The next release of LISTA will include accession numbers for each
gene entry; we will therefore be able to cross-reference SWISS-PROT
to LISTA.
Here is the current status of the model organisms:
Organism Database Index file Number of
cross-referenced sequences
-------------- -------------------------- -------------- ---------
B.subtilis SubtiList (in preparation) In preparation 777
C.elegans WormPep CELEGANS.TXT 688
D.discoideum DictyDB DICTY.TXT 198
D.melanogaster FlyBase In preparation 647
E.coli EcoGene ECOLI.TXT 2901
H.sapiens MIM MIMTOSP.TXT 2914
S.cerevisiae LISTA (in preparation) YEAST.TXT 2287
2.3 Changes in the CC line
We have introduced in this release a new 'topic' for the comments (CC)
line-type: DOMAIN. This topic is used to describe the general domain
structure of a protein. It is intended to be used for general statements
on the domain organization of specific protein and supplement the domain
information available in the feature table.
<PAGE>
Examples of its usage:
CC -!- DOMAIN: CONTAINS A COILED-COIL DOMAIN ESSENTIAL FOR VESICULAR
CC TRANSPORT AND A DISPENSABLE C-TERMINAL REGION.
CC -!- DOMAIN: THE B CHAIN IS COMPOSED OF TWO DOMAINS, EACH DOMAIN
CC CONSISTS OF 3 HOMOLOGOUS SUBDOMAINS (ALPHA, BETA, GAMMA).
The topic "ALTERNATIVE SPLICING" has been renamed to "ALTERNATIVE
PRODUCTS" as it used for describing the existence of related protein
sequence(s) produced by either alternative splicing of the same gene(s)
or by the use of alternative initiation codons.
Examples of its usage:
CC -!- ALTERNATIVE PRODUCTS: SKELETAL MUSCLE AND FIBROBLAST
CC TROPOMYOSINS ARE OBTAINED BY ALTERNATIVE MRNA SPLICING.
CC -!- ALTERNATIVE PRODUCTS: USING ALTERNATIVE INITIATION CODONS IN
CC THE SAME READING FRAME, THE GENE TRANSLATES INTO THREE
CC ISOZYMES: ALPHA, BETA AND BETA'.
2.4 Changes in the DR line
We have added cross-references from SWISS-PROT to the Yeast
Electrophoresis Protein Database (YEPD) from the Quest Protein Database
Center of Cold Spring Harbor Laboratory. These cross-references are
present in the DR lines:
Data bank identifier: YEPD
Primary identifier : Protein spot alphanumeric designation.
Secondary identifier: None; a dash '-' is stored in that field
Example : DR YEPD; 4270; -.
2.5 Status of the documentation files
SWISS-PROT is distributed with a large number of documentation files.
Some of these files have been available for a long time (the user
manual, release notes, the various indices for authors, citations,
keywords, etc.), but many have been created recently and we are
continuously adding new files. The following table list all the
documents that are currently available or that will be added in the next
few months.
USERMAN .TXT User manual
RELNOTES.TXT Release notes
SHORTDES.TXT Short description of entries in SWISS-PROT
JOURLIST.TXT List of abbreviations for journals cited
KEYWLIST.TXT List of keywords in use
SPECLIST.TXT List of organism identification codes
EXPERTS .TXT List of on-line experts for PROSITE and SWISS-PROT
<PAGE>
ACINDEX .TXT Accession number index
AUTINDEX.TXT Author index
CITINDEX.TXT Citation index
KEYINDEX.TXT Keyword index
SPEINDEX.TXT Species index
7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries
CDLIST .TXT CD nomenclature for surface proteins of human leucocytes
CELEGANS.TXT Index of Caenorhabditis elegans entries and their
corresponding gene designations and WormPep cross-
DICTY .TXT Index of Dictyostelium discoideum entries and their
corresponding gene designations and DictyDB cross-
references
EC2DTOSP.TXT Index of Escherichia coli Gene-protein database entries
referenced in SWISS-PROT
ECOLI .TXT Index of Escherichia coli K12 chromosomal entries and
their corresponding EcoGene cross-reference [4]
EMBLTOSP.TXT Index of EMBL Database entries referenced in SWISS-PROT
EXTRADOM.TXT Nomenclature of extracellular domains [1]
GLYCOSYL.TXT Index of glycosyl hydrolases classified by families on
the basis of sequence similarities [2]
HOXLIST .TXT Vertebrate homeobox proteins: nomenclature and index
MIMTOSP .TXT Index of MIM entries referenced in SWISS-PROT
NOMLIST .TXT List of nomenclature related references for proteins
PDBTOSP .TXT Index of Brookhaven PDB entries referenced in SWISS-PROT
PLASTID .TXT List of chloroplast and cyanelle encoded proteins
RESTRIC .TXT List of restriction enzymes and methylases entries
RIBOSOMP.TXT Index of ribosomal proteins classified by families on the
basis of sequence similarities [2]
YEAST .TXT Index of Saccharomyces cerevisiae entries and their
corresponding gene designations
YEAST2 .TXT Yeast Chromosome II entries [1]
YEAST3 .TXT Yeast Chromosome III entries [1]
YEAST8 .TXT Yeast Chromosome VIII entries [2]
YEAST11 .TXT Yeast Chromosome XI entries
Notes:
[1] New in release 30.
[2] Will be available starting with release 31 in February 1995.
2.6 The Expasy World-Wide Web server
The World-Wide Web (WWW), which originated at CERN, is a powerful global
information system merging networked information retrieval and
hypertext. It gives access, using hypertext links, to the documents and
information contained in all the existing WWW servers around the world,
as well as to the data obtainable through other information retrieval
systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
run on a local computer a client program (a WWW browser), which displays
hypertext documents. The user can then either request a keyword search
or jump to another document by following a hypertext link. WWW has the
<PAGE>
outstanding advantage of extending the hypertext model to the whole
world (by allowing hypertext jumps to documents anywhere on the internet
network) and by being device and user-interface independent (browsers
exist for a variety of computers and user-interfaces, including Unix
workstations running XWindows, MacIntoshes and PCs with Microsoft
Windows).
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT, PROSITE, SWISS-2DPAGE and SWISS-3DIMAGE
databases and, through any SWISS-PROT protein sequence entry, to other
databases such as EMBL, PROSITE, REBASE, FlyBase, GCRDb, MaizeDB, OMIM,
PDB and Medline. Using a browser which is able to display images one can
also remotely access 2D gels image data from SWISS-2DPAGE.
A WWW server can be accessed on the internet through its Uniform
Resource Locator (URL), the addressing system defined by the WWW model.
The URL for the ExPASy WWW server is:
http://expasy.hcuge.ch/
or
http://129.195.254.61/
To access a WWW server, you need to run a browser (or client) program on
your local computer. Browsers exist for a variety of machines and may be
obtained by anonymous ftp. ExPASy can be used with any WWW browser, but
we recommend NCSA Mosaic. It is a very flexible and powerful browser
with a graphical user interface; available for Unix boxes using
X11/Motif; for Apple McIntoshes and for Microsoft Windows. You can get
it from the FTP site: ftp.ncsa.uiuc.edu.
To access all the data available from SWISS-2DPAGE, the user's local
computer needs to run an image viewing program. For most browsers on
Unix workstations the default program is xv, a shareware application.
Similar Windows or Apple shareware or public domain applications are
also available.
For more information on the ExPASy WWW server, you can read the
following article:
Appel R.D., Bairoch A., Hochstrasser D.F.
A new generation of information retrieval tools for biologists: the
example of the ExPASy WWW server.
Trends Biochem. Sci. 19:258-260(1994).
Or you can contact Dr. Ron Appel:
Email: appel@cih.hcuge.ch
Fax: +41-22-372 61 98
<PAGE>
2.7 Weekly updates of SWISS-PROT
Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are updated at each update:
new_seq.dat Contains all the new entries since the last full release.
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release.
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Currently these files are available on the following anonymous ftp
servers:
Organization ExPASy (Geneva University Expert Protein Analysis System)
Address expasy.hcuge.ch (or 129.195.254.61)
Directory /databases/swiss-prot/updates
Organization National Center for Biotechnology Information (NCBI)
Address ncbi.nlm.nih.gov (or 130.14.20.1)
Directory /repository/swiss-prot/updates
Organization EBI ftp server
Address ftp.ebi.ac.uk (193.62.196.6)
Directory /pub/databases/swissprot/new
!! Important notes !!!
Although we try to follow a regular schedule, we do not promise to
update these files every week. In some cases two weeks will elapse in-
between two updates.
Due to the current mechanism used to build a release the entries that
are provided in these updates are not guaranteed to be error free. Also,
for the same reason, new entries do not contain an OC (Organism
Classification) line.
3. ENZYME AND PROSITE
3.1 The ENZYME data bank
Release 17.0 of the ENZYME data bank is distributed with release 30 of
SWISS-PROT. ENZYME release 17.0 contains information relative to 3546
enzymes.
3.2 The PROSITE data bank
Release 12.1 of the PROSITE data bank is distributed with release 30 of
SWISS-PROT. Release 12.1 contains 785 documentation chapters that
describes 1029 different patterns, rules and profiles/matrices. Release
12.1 does not really represent a new release; the only changes between
<PAGE>
releases 12.0 and 12.1 are updating of the pointers to the SWISS-PROT
entries whose name have been modified between releases 29 and 30. The
next release of PROSITE (13.0) will be distributed with release 31 of
SWISS-PROT.
WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new post-translational information
has become available.
<PAGE>
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.60 Gln (Q) 4.03 Leu (L) 9.22 Ser (S) 7.15
Arg (R) 5.23 Glu (E) 6.28 Lys (K) 5.84 Thr (T) 5.81
Asn (N) 4.48 Gly (G) 6.95 Met (M) 2.36 Trp (W) 1.28
Asp (D) 5.29 His (H) 2.24 Phe (F) 4.01 Tyr (Y) 3.21
Cys (C) 1.76 Ile (I) 5.61 Pro (P) 5.00 Val (V) 6.52
Asx (B) 0.005 Glx (Z) 0.005 Xaa (X) 0.02
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 4550
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 2045
2x: 740
3x: 414
4x: 255
5x: 187
6x: 192
7x: 115
8x: 80
9x: 95
10x: 44
11- 20x: 175
21- 50x: 122
51-100x: 42
>100x: 44
<PAGE>
A.2.2 Table of the most represented species
Number Frequency Species
1 2914 Human
2 2901 Escherichia coli
3 2287 Baker's yeast (Saccharomyces cerevisiae)
4 1748 Mouse
5 1610 Rat
6 777 Bacillus subtilis
7 723 Bovine
8 688 Caenorhabditis elegans
9 647 Fruit fly (Drosophila melanogaster)
10 551 Chicken
11 493 Salmonella typhimurium
12 432 African clawed frog (Xenopus laevis)
13 393 Rabbit
14 347 Pig
15 251 Arabidopsis thaliana (Mouse-ear cress)
251 Vaccinia virus (strain Copenhagen)
17 246 Maize
18 224 Fission yeast (Schizosaccharomyces pombe)
19 210 Rice
20 200 Bacteriophage T4
21 199 Pseudomonas aeruginosa
22 198 Slime mold (Dictyostelium discoideum)
23 193 Human cytomegalovirus (strain AD169)
24 184 Tobacco
25 183 Vaccinia virus (strain WR)
26 176 Pea
27 170 Wheat
28 163 Barley
29 151 Marchantia polymorpha (Liverwort)
30 147 Dog
31 146 Variola virus
32 141 Staphylococcus aureus
141 Sheep
34 139 Soybean
35 137 Pseudomonas putida
137 Rhodobacter capsulatus
37 133 Spinach
38 130 Neurospora crassa
39 128 Klebsiella pneumoniae
40 117 Bacillus stearothermophilus
117 Tomato
42 113 Agrobacterium tumefaciens
43 108 Potato
44 103 Rhizobium meliloti
<PAGE>
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 2250 1001-1100 378
51- 100 3939 1101-1200 267
101- 150 5433 1201-1300 198
151- 200 3898 1301-1400 120
201- 250 3409 1401-1500 115
251- 300 3036 1501-1600 58
301- 350 2862 1601-1700 55
351- 400 2914 1701-1800 48
401- 450 2171 1801-1900 57
451- 500 2291 1901-2000 37
501- 550 1656 2001-2100 20
551- 600 1149 2101-2200 48
601- 650 813 2201-2300 54
651- 700 617 2301-2400 21
701- 750 586 2401-2500 26
751- 800 445 >2500 122
801- 850 346
851- 900 373
901- 950 246
951-1000 234
A.4 Longest sequences
The longest sequences (>=4000 residues) are listed here:
HTS1_COCCA 5217
FAT_DROME 5147
RYNR_RABIT 5037
RYNR_HUMAN 5032
RYNC_RABIT 4969
DYHC_DICDI 4725
DYHC_DROME 4639
APB_HUMAN 4563
APOA_HUMAN 4548
RRPA_CVMJH 4488
DYHC_TRIGR 4466
GRSB_BACBR 4451
PLEC_RAT 4140
DYHC_YEAST 4092
RRPA_CVH22 4085
<PAGE>
A.5 List of the most cited journals in SWISS-PROT
Citations Journal abbreviation
4345 J. BIOL. CHEM.
3050 NUCLEIC ACIDS RES.
2777 PROC. NATL. ACAD. SCI. U.S.A.
1755 J. BACTERIOL.
1528 FEBS LETT.
1452 GENE
1407 EUR. J. BIOCHEM.
1243 EMBO J.
1227 NATURE
1171 BIOCHEM. BIOPHYS. RES. COMMUN.
1134 BIOCHEMISTRY
896 J. MOL. BIOL.
879 BIOCHIM. BIOPHYS. ACTA
869 CELL
800 MOL. CELL. BIOL.
715 MOL. GEN. GENET.
683 VIROLOGY
641 BIOCHEM. J.
620 PLANT MOL. BIOL.
535 SCIENCE
521 J. BIOCHEM.
475 MOL. MICROBIOL.
436 J. VIROL.
382 J. GEN. VIROL.
268 J. CELL BIOL.
247 BIOL. CHEM. HOPPE-SEYLER
242 GENOMICS
232 GENES DEV.
213 HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
212 ARCH. BIOCHEM. BIOPHYS.
210 CURR. GENET.
201 J. IMMUNOL.
174 MOL. BIOCHEM. PARASITOL.
171 MOL. ENDOCRINOL.
169 YEAST
164 J. GEN. MICROBIOL.
154 INFECT. IMMUN.
150 J. CLIN. INVEST.
142 DNA
138 ONCOGENE
123 PLANT PHYSIOL.
122 J. EXP. MED.
122 FEMS MICROBIOL. LETT.
114 J. MOL. EVOL.
111 AM. J. HUM. GENET.
101 HUM. MOL. GENET.
<PAGE>
APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES
The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:
**********************
*********************** * EPD [Euk. Promot.] *
* EMBL Nucleotide * <----> **********************
* Sequence Data *
****************** * Library * **********************
* FLYBASE * <--> *********************** <----- * ECD [E. coli map] *
* [Drosophila * ^ ^ ^ **********************
* genomic d.b.] * <---------+ | | |
****************** | | | +------------ **********************
| | | * TFD [Trans. fact.] *
| | | **********************
****************** | | |
* MaizeDb * <------+ | | | **********************
****************** | | | +--------------> * GCRDb [7TM recep.] *
| | | | **********************
****************** | | | |
* WormPep * | | | | **********************
* [C.elegans] * <----+ | | | | +------> * DictyDB [D.disco.] *
****************** | | | | | | **********************
| | | | | |
****************** | v v v v v **********************
* REBASE * *********************** * ENZYME [Nomencl.] *
* [Restriction * <--- * SWISS-PROT * <----- **********************
* enzymes] * * Protein Sequence * |
****************** * Data Bank * v
*********************** **********************
****************** ^ ^ ^ | | ^ ^ | * OMIM [Diseases] *
* EcoGene/EcoSeq * | | | | | | | +--------> **********************
* [E. coli] * <-----+ | | | | | |
****************** | | | | | +-----------> **********************
| | | | | * ECO2DBASE [2D] *
| | | | | **********************
****************** | | | | |
* PROSITE * <--------+ | | | +-------------> **********************
* [Patterns] * | | | * SWISS-2DPAGE [2D] *
****************** | | | **********************
| | | |
| | | +----------------> **********************
| | | * Aarhus/Ghent [2D] *
| | +---------------+ **********************
| v |
| *********************** | **********************
+--------> * PDB [3D structures] * +--> * YEPD (yeast) [2D] *
*********************** **********************
<PAGE>