Coordination between projects

We hope to coordinate between XML by suggesting a number of common element sets for DTDs. For example, we suggest that when people want to provide raw sequence information, they use the following DTD snippet. This page is in development... email Ewan Birney if you want to help

<!ELEMENT seq (dbxref*, residues?) >
  <!ATTLIST seq
    id           ID        #REQUIRED
    length       CDATA   ; #IMPLIED >
<!ELEMENT residues (#PCDATA)>
  <!ATTLIST residues
    type      (dna | rna | aa)        #REQUIRED >

Each particular project can mix any number of these "snippets" together and provide their own extensions, either as additional elements or attributes. For example, the Game DTD provides additional elements for maturity as an attribute (along with others) for this DTD. At the moment, these proposed common elements are mainly due to pieces of the GAME project and the interpro XML which overlapped and looked common enough to be sensible to generalise. We welcome any comments on these elements.

Proposed common elements

Date
```
<!ELEMENT date (#PCDATA)>
<date>1999-9-12</date>
```
The date tag indicates a date, and inside the tag there is a string formatted to the ISO date standard. Notice this means that valid/invalid dates are not enforced by the DTD parser, but must be checked programmatically later
We are aware of the proposed date type in the new XMI standard, and obviously this should work well with that system (we should check!).
Database Cross Reference
```
<!ELEMENT dbxref (database, unqiue_id, version?)>
<!ELEMENT database (#PCDATA)>
<!ELEMENT unique (#PCDATA)>
<!ELEMENT version (#PCDATA)>
<dbxref>
  <database>swissprot</database>
  <unique_id>P09651</unique_id>
</dbxref>
```
dbxref indicates a single object in another database. The database tag indicates the database (for example, swissprot or embl). The choice of putting these things in as elements, not attributes is to allow better use of XSL and other automatic XML translation tools for easy viewing. The unique id element indicates the key for the object in the database, which should be the natural key for the object in the database. For the protein and DNA databases, these are the accession numbers of the sequences. The version element is optional, and provides a placeholder for the additional version information as proposed for a number of database keys (such as DNA accession numbers and pids).
We believe that this dbxref tag works well with the proposed Identifier string syntax used in the LSR CORBA submission on Sequence Analysis. We hope to define the precise mapping between dbxref and the Identifier string in the future.
Sequence Data
```
<!ELEMENT seq (dbxref*, residues?) >
  <!ATTLIST seq
    id           ID        #REQUIRED
    name         CDATA     #IMPLIED
    length       CDATA     #IMPLIED >
<!ELEMENT residues (#PCDATA)>
<seq id="seq_dtd_id_12" name="common_name">
  <dbxref>
     <database>swissprot</database>
     <unique_id>P09651</unique_id>
  </dbxref>
<residues type="aa">
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVT
YATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHL
RDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKAL
SKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSG
DGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGG
SGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS
SSSSYGSGRRF
</residues>
```
The seq tag indicates a sequence, which is the minimal information need to transmit a sequence around. Notice that sequences can be with residues defined or not (sometimes one wants to have a placeholder sequence definition without actually having the residues: for example, when referring to genomic DNA sequence). The length attribute provides a way of indicating the length of the sequence when no resiudes are provided. If residues are provided, then the length of the residues are the actual length: if the length attribute disagrees with this a program is at liberty to flag an error, and if it wishes to continue, should take the length as being the length of the resiudes.
The name attribute indicates a common name (or human readable name) of the sequence if desired.
The id attribute is an XML id and so must be unique within the XML document. It is an important attribute as it allows additional information about the sequence to refer to the sequence using a ID/IDREF system (for example, sequence features in GAME use this). We feel that there is so much use of relationships to sequences that this is important to make standardised.
Any number of dbxrefs can be associated with a sequence definition. In general these dbxrefs are meant to be placeholders to indicate, if wished where this sequence came from, but they can be used to additionally indicate other crossreferences (for example, a protein sequence to a DNA sequence).

Computational Analysis

<!ELEMENT computational_analysis (date?, program, version?, parameter*, 
                                  database?, result_set+)>
  <!ATTLIST computational_analysis  
    seq       IDREF   #REQUIRED
  >
<!ELEMENT program (#PCDATA)>
<!ELEMENT result_set (score?, output*, result*)>
<!ELEMENT result (score, type, subtype?, seq_relationship+, output*)>
  <!ATTLIST result
    id         ID        #IMPLIED
  >
<!ELEMENT seq_relationship (location, alignment?)>
  <!ATTLIST seq_relationship
    seq IDREF #REQUIRED
    type (query | subject | peer ) #REQUIRED
  >
<!ELEMENT alignment (#PCDATA)>
<!ELEMENT parameter (type, value)>
<!ELEMENT output (type, value)>
<!ELEMENT database (name, date?, version?)>
<!ELEMENT version (#PCDATA)>
<!ELEMENT score (#PCDATA)>
<computational_analysis>
  <program>blast</program>
  <version>2.0.1</version>
  <parameter>
    <type>B</type>
    <value>10000</value>
  <database>swissprot</database>
  <result_set>
    <score>1344</score>
    <output>
      <type>evalue</type>
      <value>1e-9</value>
    </output>
    <result>
      <seq_relationship seq="seq_id_132">
        <location> 
          <start>120</start>
          <end>160</end>
        </location>
      </seq_relationship>
    </result>
  </result_set>
</computational_analysis>

computational_analysis is the output from one execution of a program. date is an optional element, but is recommended for record keeping to know when the program was run. The program is