Annotation format

PubAnnotation uses JSON as its default format to store annotations. This document describes how annotations are represented in JSON in PubAnnotation.

PubAnnotation JSON annotation format supports three different types of information:

  • denotation,
  • relation, and
  • modification.

Denotations

A denotation connects a span of text to a conceptual object. In following example, there are two denotation annotations:

{
"text": "IRF-4 expression in CML may be induced by IFN-α therapy",
"denotations": [
{"id": "T1", "span": {"begin": 0, "end": 5}, "obj": "Protein"},
{"id": "T2", "span": {"begin": 42, "end": 47}, "obj": "Protein"}
]
}

Following is a visualization of the above annotation, generated by TextAE:

denotation example

Note that in the visualization, labels are truncated in the end in case of insufficient space.

The example states that there are two denotations, T1 and T2.

  • The first one connects span 0-5 (the text spanning from 0’th to 5th characters) to Protein,
  • while the second connects span 42-47 to Protein.

The semantic interpreation may vary. However, the default interpretation of T1 is as follows:

  • the text span between the 0’th and the 5’th characters
    • "span":{"begin":0, "end":5}
  • denotes an entity T1
    • "id":"T1"
  • of which the type is Protein.
    • "obj":"Protein"

Relations

A relation connects two entities.

{
"text": "IRF-4 expression in CML may be induced by IFN-α therapy",
"denotations": [
{"id": "T1", "span": {"begin": 0, "end": 5}, "obj": "Protein"},
{"id": "T2", "span": {"begin": 42, "end": 47}, "obj": "Protein"}
],
"relations": [
{"id": "R1", "subj": "T1", "pred": "interactWith", "obj": "T2"}
]
}

relation example

The example above states that the two entities, T1 and T2, that are introduced by the two denotations, are related to each other by the predicate, interactWith. Note that the two entities are specified by the two different keys, subj and obj, so the relation is directional. The design is motivated for a better compatibility with RDF.

Note that PubAnnotation does not enforce any specific annotation scheme, e.g., the labels for obj in denotations and those for pred in relations, and it is fully up to the producer of annotation how to design the scheme of his/her annotation. For example, while the way of annotation in above example may be familiar to the community which seeks informatin on protein-protein interaction, another community, e.g., BioNLP Shared Task, may be more familiar with a finer-grained annotation.

{
"text": "IRF-4 expression in CML may be induced by IFN-α therapy",
"denotations": [
{"id": "T1", "span": {"begin": 0, "end": 5}, "obj": "Protein"},
{"id": "T2", "span": {"begin": 42, "end": 47}, "obj": "Protein"},
{"id": "E1", "span": {"begin": 6, "end": 16}, "obj": "Expression"},
{"id": "E2", "span": {"begin": 31, "end": 38}, "obj": "Regulation"}
],
"relations": [
{"id": "R1", "subj": "T1", "pred": "themeOf", "obj": "E1"},
{"id": "R2", "subj": "E1", "pred": "themeOf", "obj": "E2"},
{"id": "R3", "subj": "T2", "pred": "causeOf", "obj": "E2"}
]
}

relation example 2

Modifications

A modification annotation modifies the meaning of denotations and relations, specifically in terms of negation and speculation.

{
"text": "IRF-4 expression in CML may be induced by IFN-α therapy",
"denotations": [
{"id": "T1", "span": {"begin": 0, "end": 5}, "obj": "Protein"},
{"id": "T2", "span": {"begin": 42, "end": 47}, "obj": "Protein"}
],
"relations": [
{"id": "R1", "subj": "T1", "pred": "interactWith", "obj": "T2"}
],
"modifications": [
{"id": "M1", "pred": "Speculation", "obj": "R1"}
]
}

modification example

In the above example, the modification annotation, M1, states that the relation, R1, is speculative rather than declarative. The annotation may be motivated by the word, may, in the sentence. However, again, PubAnnotation does not enforce any specific annotation scheme, and actual annotation may be performed in a completely different way.

{
"text": "IRF-4 expression in CML may be induced by IFN-α therapy",
"denotations": [
{"id": "T1", "span": {"begin": 0, "end": 5}, "obj": "Protein"},
{"id": "T2", "span": {"begin": 42, "end": 47}, "obj": "Protein"},
{"id": "E1", "span": {"begin": 6, "end": 16}, "obj": "Expression"},
{"id": "E2", "span": {"begin": 31, "end": 38}, "obj": "Regulation"}
],
"relations": [
{"id": "R1", "subj": "T1", "pred": "themeOf", "obj": "E1"},
{"id": "R2", "subj": "E1", "pred": "themeOf", "obj": "E2"},
{"id": "R3", "subj": "T2", "pred": "causeOf", "obj": "E2"}
],
"modifications": [
{"id": "M1", "pred": "Speculation", "obj": "E2"}
]
}

modification example 2

In the above example, the modification annotation, M1, speculates (the existence of) the entity (a regulation event), E2, instead of speculating a relation.

Note that the syntax of modification annotation is experimental and subject to chanage.

What labels to use for pred of modification is up to the designer of annotation. However, currently the visualiztion of TextAE supports only Speculation and Negation.

Multi-layer annotations

Multi-layer annotations - annotations which are made by multiple projects to the same text - can be represented as muptiple tracks.

Usually, you will access annotations within a project, e.g.,

  • http://pubannotation.org/projects/GO-BP/docs/sourcedb/PubMed/sourceid/10704529/spans/0-119/annotations.json

In the case, you will get the annotations without tracks:

{
"target":"http://pubannotation.org/docs/sourcedb/PubMed/sourceid/10704529",
"sourcedb":"PubMed",
"sourceid":"10704529",
"text":"Ultrastructural localization of sulfated and unsulfated keratan sulfate in normal and macular corneal dystrophy type I.",
"project":"GO-BP",
"denotations":[
{"id":"T1","span":{"begin":16,"end":28},"obj":"http://purl.obolibrary.org/obo/GO_0051179"},
{"id":"T5","span":{"begin":32,"end":40},"obj":"http://purl.obolibrary.org/obo/GO_0051923"},
{"id":"T8","span":{"begin":64,"end":71},"obj":"http://purl.obolibrary.org/obo/GO_0051923"}
]
}

However, if you access annotations without indication of a project (or if you specify multiple projects), e.g.,

  • http://pubannotation.org/docs/sourcedb/PubMed/sourceid/10704529/spans/0-119/annotations.json

then you will get the annotations in multiple tracks:

{
"target":"http://pubannotation.org/docs/sourcedb/PubMed/sourceid/10704529",
"sourcedb":"PubMed",
"sourceid":"10704529",
"text":"Ultrastructural localization of sulfated and unsulfated keratan sulfate in normal and macular corneal dystrophy type I.",
"tracks":[
{
"project":"GO-BP",
"denotations":[
{"id":"T1","span":{"begin":16,"end":28},"obj":"http://purl.obolibrary.org/obo/GO_0051179"},
{"id":"T5","span":{"begin":32,"end":40},"obj":"http://purl.obolibrary.org/obo/GO_0051923"},
{"id":"T8","span":{"begin":64,"end":71},"obj":"http://purl.obolibrary.org/obo/GO_0051923"}
]},
{
"project":"GlycoBiology-GDGDB",
"denotations":[
{"id":"_T1","span":{"begin":86,"end":116},"obj":"http://acgg.asia/db/diseases/gdgdb?con_ui=CON00391"},
{"id":"_T2","span":{"begin":86,"end":118},"obj":"http://acgg.asia/db/diseases/gdgdb?con_ui=CON00391"}
]
}
]
}

Note that the difference comes whether a project is specified or not in the URL.

Discontinuous spans

Sometimes, there may be a case of denotation for which you may want to involve multiple discontinuous spans. For example, what if you want to annotate left lung in the text, left or right lung, with the ontology id, UBERON:0002168. As the two words are not adjacent to each other, it is not straightforward to specify the span of the denotation.

For representation of discontinuous spans as the span of a denotation, PubAnnotation supports two models: (1) bagging model, and (2) chaining model.

Bagging model

In the bagging model, it is allowed to specify the span of a denotation by an array of begin and end offsets, e.g.,

{
"text":"left and right lung",
"denotations":[
{"id":"T2","span":[{"begin":0,"end":4},{"begin":15,"end":19}],"obj":"UBERON:0002168"}
]
}

The bagging model may be intuitively easy to understand particularly in the JSON representation. However, it is a kind syntactic sugar which is beyond the normal representation of PubAnnotation. Internally, it is converted to the chaining model.

Note that in the bagging model, a span may be specified either by just a single pair of begin and end offsets, or by an array of pairs. Therefore, for a software program to read a JSON representation of annotation, it must perform a dynamic type checking, a.k.a. duck typing.

Chaining model (default)

The chaining model uses normal syntax of PubAnnotation JSON format. Instead, it uses special vocabularly to represent an involvement of multiple discontinuous spans in a denotation. For example, the above example in the bagging model will be internally converted to the chaining model as below:

{
"text":"left and right lung",
"denotations":[
{"id":"T1","span":{"begin":0,"end":4},"obj":"_FRAGMENT"},
{"id":"T2","span":{"begin":15,"end":19},"obj":"UBERON:0002168"}
],
"relations":[
{"id":"R1","pred":"_lexicallyChainedTo","subj":"T2","obj":"T1"}
]
}

It will be rendered in TextAE as below:

chaining discontinuous spans example

PubAnnotation uses the chaining model as default. The JSON representation in the bagging model can be accessed by setting the parameter discontinuous_span to be the value, bag, e.g.,

  • http://pubannotation.org/projects/example/docs/sourcedb/@Jin-Dong%20Kim/sourceid/2/annotations.json?discontinuous_span=bag