Introduction

Timed Text Markup Language (TTML), previously referred to as Distribution Format Exchange Profile (DFXP), is an XML-based World-Wide Web Consortium (W3C) standard for timed text in online media. It was designed for the authoring, transcoding, and exchanging of timed text information, timed text being used mostly for subtitling and captioning of video content.

TTML2, the second major revision of the language, was finalised on November 8, 2018. It has been adopted widely in the television industry, including by Society of Motion Picture and Television Engineers (SMPTE), European Broadcasting Union (EBU), Advanced Television Systems Committee (ATSC), Digital Video Broadcasting (DVB), Hybrid Broadcast Broadband TV (HbbTV) and Moving Picture Experts Group (MPEG). Several profiles and extensions of the language have been developed since the standard was published.

TTML content may also be used directly as a distribution format and is widely supported by most media players. However, this does not include major web browsers, where Web Video Text Tracks (WebVTT), the second W3C standard for timed text in online media, has better built-in support in connection with the HTML5 “” element; many organisations nevertheless use TTML content for web video using their own player code.

The TTML standard specifies a wide range of features, of which only a subset is sometimes necessary, depending on the specific application. For this reason, the standard was enhanced by the development of profiles, which are subsets of required features from the full specification. TTML1 defines three standard profiles: DFXP Transformation, DFXP Presentation and DFXP Full. Many profiles of TTML have been developed by W3C and other organizations over the years as subsets or extensions of TTML. The Timed Text Working Group maintains a registry of these TTML profiles.

History

Work on adding timing information on the Web by extending HTML began in 2003 with the launch of W3C’s Timed Text Working Group (TTWG), which was chartered to develop a format to represent streamable text synchronized with timed media like audio or video. This project was built on work done by W3C on the Synchronized Multimedia Integration Language (SMIL). As XML is the de facto standard for data exchange on the web, TTWG produced an XML-based standard and an early draft was released in November 2004 as Timed Text (TT) Authoring Format 1.0 – Distribution Format Exchange Profile (DFXP). The first version of TTML, TTML1, was finalised in November 2010.

In 2010, after discussions about its adoption in HTML5, the Web Hypertext Application Technology Working Group (WHATWG) opted for a new but more lightweight standard based on the popular Microsoft SubRip format, now named WebVTT. In February 2012, the Federal Communications Commission (FCC) declared the SMPTE closed-captioning standard for online video content, a superset of TTML, as a “safe harbour interchange, delivery format”.

TTML2, the second version of TTML, started in February 2015, was finalised in November 2018, along with a new revision of TTML1.

Synchronized Multimedia Integration Language (SMIL) is a W3C recommended XML markup language to describe multimedia presentations. It defines markup for timing, layout, animations, visual transitions, and media embedding, among other things. SMIL allows for presenting media items such as text, images, video, audio, links to other SMIL presentations, and files from multiple web servers. SMIL markup is written in XML and has similarities to HTML. A SMIL document is similar in structure to an HTML document in that they are typically divided between an optional “” section and a required “” section. The “” section contains layout and metadata information. The “” section contains the timing information and is usually composed of combinations of three main tags—sequential (““, simple playlists), parallel (““, multi-zone/multi-layer playback) and exclusive

(““, event-triggered interrupts).

Accessibility and media

TTML is typically used when there is a requirement to add subtitles or closed captions to video content such as movies, instructional films, etc. It is important to understand the differences between the two and the reasons for their use in enhancing content.

Subtitles are used when the viewer can hear the audio, but does not understand the language being used. They are just a translation of the dialogue being spoken. Closed Captions, on the other hand, are used as Subtitles for the Deaf and Hard-of Hearing (SDH). They are used to aid the hard-of-hearing by communicating all audio sounds, including sound effects, speaker IDs, and other non-speech elements.

For example, a subtitle may just say:

“Watson, the game’s afoot!”

An equivalent Closed Caption would say:

“[Violin plays]”

“Holmes (excitedly): Watson, the game’s afoot!”

Subtitles can be simply “burnt in” to the video, in which case they form part of the video stream, which means they cannot be moved so as not to obscure an important part of the image and they cannot be turned off. A much better system is to transmit them as a separate file. Users can then usually select subtitles by clicking the same CC icon they would use to turn on captions. Captions can also be made to move around the screen so as not to block significant areas. Although subtitles and closed captions have different intentions for use, they are both always synchronised with the media and normally give the users the ability to toggle on and off.

TTML was designed to incorporate all the features of existing caption formats, and as such, it includes a rich set of functions, including:

  • positioning
  • alignment
  • styling
  • animation
  • multiple languages
  • metadata
  • multiple captions on the screen simultaneously

In the US, the Americans with Disabilities Act (ADA) is a broad, anti-discrimination law for people with disabilities. Titles II and III of the ADA affect web accessibility and closed captioning. Title II prohibits disability discrimination by all public entities at the local and state level. Governmental organizations must ensure “effective communication” with citizens, including providing assistive technology or services as needed. Title III prohibits disability discrimination by “places of public accommodation.” A place of public accommodation covers shared or public entities like libraries, universities, hotels, museums, theatres, transportation services, etc., that are privately owned. Video displayed within or distributed by such places must be captioned.

In the UK, Ofcom has included a section in their code for broadcasters that mandates certain standards for the use of SDH. This code sets the specific requirements for subtitling, sign language, and audio description for licensed television broadcasters. Broadcasters need to reach certain accessibility milestones 5 years and 10 years after the ‘relevant date’: 60% of all programming must be subtitled by year 5, and 80% must be subtitled by year 10. Unless otherwise specified, the ‘relevant date’ is the first date of licensed broadcasting.

There are, of course, many other situations where subtitles/captions can be useful to the viewer. For example, in an open-plan office environment where an audible soundtrack would be distracting for other workers close by. It may not be practical to wear headphones, as they would interfere with essential conversations. This may also be true at home, where one person wants to watch an instructional video on YouTube while normal family life goes on around them.

XML and Regulation

XML is a very general markup language which can be configured to suit a wide variety of individual applications. Each application can have its own dialect of XML dedicated to the specific requirements of that application. Examples include:

  • IXBRL (Extended Business Reporting Language): used for financial reporting
  • S1000D: used for Aerospace & Defence documentation
  • GML (Geography Markup Language): used in Geographic Information Systems
  • DITA (Darwin Information Typing Architecture): a document format used by publishers
  • DocBook: a markup language for technical documentation
  • JATS: a vocabulary used for the preparation and publication of scholarly articles
  • NISO/STS: a vocabulary used for writing international standards

There are hundreds of these dialects, but a crucial point is that many of them are mandated by the bodies responsible for regulating the production and exchange of data and documentation within those industries. A good example is iXBRL. Within the last ten years, the Securities and Exchange Commission (SEC), the United Kingdom’s HM Revenue and Customs (HMRC), and Singapore’s Accounting and Corporate Regulatory Authority (ACRA), have all begun to require companies to use it, and other national regulators are following suit. When an XML standard is mandatory, it is vital that files are accurate and complete, and the editing process must ensure that this is the case.

TTML is becoming the de facto standard for captioning across the broadcasting and streaming video industry with Advanced Television Systems Committee (ATSC), the European Broadcasting Union (EBU), Freeview and many other organisations now mandating it. Accuracy in editing and transmitting TTML files is therefore now becoming essential.

Streaming Services

In 2015, Netflix, Home Box Office (HBO), Telestream, SMPTE, and W3C received a Technology & Engineering Emmy Award for the category “Standardization and Pioneering Development of Non-Live Broadband Captioning,” for their work on TTML. TTML is growing in popularity for use with web-based applications, including Adobe Flash, Premiere Pro, and Microsoft Silverlight. It is also widely supported among video hosting and streaming services, such as YouTube, Netflix, and Amazon Video. Most video platforms such as Brightcove, Ooyala, and Kaltura (MediaSpace) also support TTML.

The TTML file format is supported by most video players, streaming platforms, authoring tools, and editing software, including:

  • YouTube
  • Netflix
  • Amazon Video
  • Yahoo
  • AOL
  • Vimeo
  • Dailymotion
  • YouView
  • Metacafe
  • Brightcove
  • Ooyala
  • Kaltura (MediaSpace)
  • Limelight Networks
  • Adobe Media Server
  • Adobe Connect
  • Adobe TV
  • Adobe Premiere Pro
  • Adobe Flash
  • Open Source Media Framework (OSMF)
  • Adobe Presenter
  • Panopto
  • VLC
  • Flowplayer
  • JW Player
  • Subtitle Edit
  • Microsoft PowerPoint 2013 & Office 365 with Office Mix

TTML Files

A TTML file is written in XML format and typically has the following structure:

The < tt > Element

The < head > Element

The < body > Element

The < br > and < span > Elements

Time Formats

The outermost element is the Timed Text, or < tt > element. The other elements are nested between the < tt > and < /tt > tags, which mark the beginning and end of the < tt > element. The < head > element is optional. It contains information about styles, layouts, and document metadata. The < body > element contains the actual subtitles/captions.

In addition to being the root element (i.e., the overall container for the document), the < tt > element is also used to specify document level metadata. This info may include a document title, description, language, namespaces, and copyright information. In addition, arbitrary metadata drawn from other namespaces may be specified.

The < head > element specifies styles, regions, and metadata. Styles are used to indicate the desired look and feel of subtitles/captions. Regions define the size and location of the caption box. Metadata provides information about the document that might be used by editing, processing, or rendering tools.

The < body > element contains the actual subtitles/captions. Each subtitle is wrapped in a < p > element. Each < p > element has a “begin” and “end” attribute specifying the start and end time for the subtitle to be shown on the screen as well as the text to be shown. Other attributes may be specified, the most common being style and region. The < div > element can be used to group < p > elements that all share some common attribute, such as language, region, or font style.

Markup can also be used within the caption text itself. For example, the < br/ > element is used to force a line break. The < span > element is used to change the font style of only a portion of the text.

Times can be expressed either in clock-time format or offset-time format. In either case, they are offsets that are typically relative to the beginning of the video (time zero). Clock-time format can be expressed in one of the following ways:

hours:minutes:seconds.fraction (e.g., “00:07:15.25”)

hours:minutes:seconds:frames (e.g., “00:07:15:06”)

Note that each segment is zero-padded to two digits. In the first example, seconds are expressed as fractional decimal. The example time of “00:07:15.25” represents 7 minutes, 15 seconds, and 250 milliseconds from the beginning of the video. In the second example, frames are used instead of fractional seconds. The example time of “00:07:15:06” represents the 6th frame after 7 minutes and 15 seconds have passed.

Offset-time format is expressed as a single fractional decimal number followed by unit indicator (aka “metric”). The unit indicator can be one of the following: “h” (hours), “m” (minutes), “s” (seconds), “ms” (milliseconds), “f” (frames), “t” (ticks). The most common unit indicator would be seconds. For example, the clock time of “00:07:15.25” expressed in offset-time would be “432.25s.

For a more detailed description of the TTML file format and functions available go to: https://www.speechpad.com/captions/ttml

File Comparison

During the production and editing of a video for broadcast or streaming, several versions may be produced. As captioning is becoming mandatory in many environments, this may also entail the creation of several versions of the associated TTML file. This file may be created by the video producers themselves or by a contracted captioning company. In either case, more than one person may work on the file and numerous revisions may be created. It is essential that changes between versions can be easily identified and accepted or rejected. Doing this manually is tedious, error-prone, and time-consuming. What is needed is a tool which can make these comparisons automatically and which is, crucially, aware of the XML (TTML) structure of the files in order to correctly identify the significant differences.

Comparison solutions like DeltaXML’s XML Compare toolkit can be employed at every stage of the subtitle/caption editing and production process. It finds all the meaningful changes between any two XML files speeding up revision time between edits. These files may be created by the video production company, the captioning company, or a third party. Accurate, automated comparisons ensure that the correct TTML file is always associated with the appropriate video content. With standards like TTML becoming more regulated, it’s important that comparison solutions are prioritised within the creation, maintenance and review of TTML data.

Regulatory Requirements

As stated above, TTML is becoming the de facto standard for captioning across the broadcasting and streaming video industry. Many national and international regulatory bodies now require TTML captioning to be added to broadcast and streaming video content for the purposes of accessibility to deaf and hard-of-hearing viewers. Our software product, XML Compare, can help you take some of the effort out of producing and editing TTML captioning files by providing fast, automated, accurate version comparison and change tracking.

About DeltaXML

At DeltaXML we help developers who are working with XML documents and datasets and need to identify differences, manage change and merge documents. XML ‘differencing’ can be particularly challenging to implement in software, so we offer XML change and merge tools that can be embedded in almost any product or system, using simple, well-documented APIs.

Our patented approach analyses the structure of the XML files and applies attributes to identify all the relevant differences. The outputs are well-structured XML which can be easily interpreted by automated systems or presented in documents and editing tools.

Our software is embedded into major XML tools and used by blue-chip organisations in aerospace, finance and healthcare.

To book a one-to-one demonstration please visit: https://www.deltaxml.com/bookademo/

Or, to try our software free for 14 days, please visit: https://www.deltaxml.com/trials/