TTML Files
A TTML file is written in XML format and typically has the following structure:
The < tt> Element
The < head> Element
The < body> Element
The < br> and < span> Elements
Time Formats
The outermost element is the Timed Text, or < tt> element. The other elements are nested between the < tt> and < /tt> tags, which mark the beginning and end of the < tt> element. The < head> element is optional. It contains information about styles, layouts, and document metadata. The < body> element contains the actual subtitles/captions.
In addition to being the root element (i.e., the overall container for the document), the < tt> element is also used to specify document level metadata. This info may include a document title, description, language, namespaces, and copyright information. In addition, arbitrary metadata drawn from other namespaces may be specified.
The < head> element specifies styles, regions, and metadata. Styles are used to indicate the desired look and feel of subtitles/captions. Regions define the size and location of the caption box. Metadata provides information about the document that might be used by editing, processing, or rendering tools.
The < body> element contains the actual subtitles/captions. Each subtitle is wrapped in a < p> element. Each < p> element has a “begin” and “end” attribute specifying the start and end time for the subtitle to be shown on the screen as well as the text to be shown. Other attributes may be specified, the most common being style and region. The < div> element can be used to group < p> elements that all share some common attribute, such as language, region, or font style.
Markup can also be used within the caption text itself. For example, the < br/> element is used to force a line break. The < span> element is used to change the font style of only a portion of the text.
Times can be expressed either in clock-time format or offset-time format. In either case, they are offsets that are typically relative to the beginning of the video (time zero). Clock-time format can be expressed in one of the following ways:
hours:minutes:seconds.fraction (e.g., “00:07:15.25”)
hours:minutes:seconds:frames (e.g., “00:07:15:06”)
Note that each segment is zero-padded to two digits. In the first example, seconds are expressed as fractional decimal. The example time of “00:07:15.25” represents 7 minutes, 15 seconds, and 250 milliseconds from the beginning of the video. In the second example, frames are used instead of fractional seconds. The example time of “00:07:15:06” represents the 6th frame after 7 minutes and 15 seconds have passed.
Offset-time format is expressed as a single fractional decimal number followed by unit indicator (aka “metric”). The unit indicator can be one of the following: “h” (hours), “m” (minutes), “s” (seconds), “ms” (milliseconds), “f” (frames), “t” (ticks). The most common unit indicator would be seconds. For example, the clock time of “00:07:15.25” expressed in offset-time would be “432.25s.
For a more detailed description of the TTML file format and functions available go to: https://www.speechpad.com/captions/ttml