| A sequence feature is no more than a classification of an interval of a sequence. So the sentence, "Sequence a's exon one is bases 10 through 39." is a sequence feature description. The sequence is "sequence a", the interval is "bases 10 through 39" and the classification is "exon one".
The FeatureDtd has tags for type - the classification, description - a human readable note, and seq_relationship - a very smart way of specifying an interval. You need to understand the seq_relationship to get what's going on here.
A sequence is often annotated based on homologies to other sequences. The seq_relationship tag allows you to specify this homology as a span on both the query sequence and on any 'hits'. An important outcome of this is that you can see the ordering of these spans on two different sequences. Let's say you're comparing a gene in mouse and human. You are trying to determine the exon structure. You have four regions of high similarity. If these homologous regions are in the same order in both human and mouse, it suggests that the highly homologous regions are four exons. However, if the regions are ordered 1, 2, 3, 4 in human, but 1, 3, 2, 4 in mouse, or if they are in the same order in both but one or more are inverted, it suggests that some of your homologies may be repeats or other problematic regions - certainly they are not exons 1 through 4. These types of relationships can't be seen without the knowledge of both intervals.
The other interesting thing to note is the parent IDREF. This is to denote if a feature is the child of another feature. An example of this is a blast hit composed of multiple HSP's. The hit would be the parent feature and each HSP would point to the parent's ID. The coordinates given for the child features DO NOT become relative to the parent feature. They stay relative to the original sequence. So you could calculate that a feature was within another feature, even without the explicit reference.
The thing which a feature does not have is a seq reference. This is mostly because this dtd isn't really designed to stand on it's own. Currently it's used in the ComputationDtd and AnnotationDtd?. These both link the features to the parent sequences. Eventually there may be an extended, stand-alone DasGff?-like feature which will have to link a parent seq. This dtd's url is http://www.bioxml.org/dtds/current/feature.dtd.
<!ELEMENT bx-feature:feature (
bx-feature:type,
bx-feature:description?,
bx-feature:seq_relationship*
)>
<!ATTLIST bx-feature:feature
xmlns:bx-feature CDATA #FIXED "http://www.bioxml.org/feature/v0_1"
bx-feature:id ID #REQUIRED
bx-feature:parent IDREF #IMPLIED
>
<!ELEMENT bx-feature:type (#PCDATA)>
<!ELEMENT bx-feature:description (#PCDATA)>
<!ELEMENT bx-feature:seq_relationship (
bx-feature:span,
bx-feature:alignment?
)>
<!ATTLIST bx-feature:seq_relationship
bx-feature:seq IDREF #IMPLIED
bx-feature:type (query | subject | peer | subseq) #IMPLIED
>
<!ELEMENT bx-feature:span (
bx-feature:start,
bx-feature:end
)>
<!ATTLIST bx-feature:span
bx-feature:between (TRUE) #IMPLIED
bx-feature:either_dir (TRUE) #IMPLIED
>
<!ELEMENT bx-feature:start (#PCDATA)>
<!ELEMENT bx-feature:end (#PCDATA)>
<!ELEMENT bx-feature:alignment (#PCDATA)>
Here is a sample xml file. You can validate this file with your java xerces (assuming you're running linux/unix) with the command:
- java sax.SAXCount -Nwv http://www.bioxml.org/samples/feature.xml The -v flag means validate. The -w is warmup the parser before timing and -N means turnoff namespaces so that FullyQualified? names don't give weird errors. This program doesn't do anything other than validate, count the tags in, and time the parsing of the document.
<?xml version="1.0"?>
<!DOCTYPE bx-feature:feature SYSTEM "/home/brad/tmp/dtds/feature.dtd">
<bx-feature:feature bx-feature:id='b46'>
<bx-feature:type>exon1</bx-feature:type>
<bx-feature:seq_relationship>
<bx-feature:span>
<bx-feature:start>1</bx-feature:start>
<bx-feature:end>100</bx-feature:end>
</bx-feature:span>
</bx-feature:seq_relationship>
</bx-feature:feature>
And that's it for features. Now we move on to something useful, the ComputationTutorial.
Related pages: Unclassified?
|