Advanced Query Documentation

This page documents the full query language supported by the the Stanford Cable TV News Analyzer. Prior to reading this documentation, we recommend that you read the getting started tutorial.

Basic Query Syntax

All queries compute the total time of video segments in the dataset that match the query's filters. Screen time queries are broken in to several parts. A basic query consists of filters separated by "AND"s.

For example, the following query computes the screen time of Kamala Harris on CNN:

Likewise, "OR" is also supported.

The following query counts the total video time on CNN or MSNBC.

Queries can combine "AND" and "OR" using parentheses in order to construct more complex queries. If no parentheses are specified, AND precedes OR.

Putting the examples above together, the following query computes the screen time of Kamala Harris on CNN or MSNBC:

If the query is left blank, then no filters are applied and all of the data is counted.

Supported Query Filters

Filters on Entire Videos
channel	description	name of the channel
	values	CNN, FOX, MSNBC
	default	`all`
	example	`channel="CNN"`
show	description	name of the show
	values	list of shows
	default	`all`
	example	`show="CNN Newsroom"`
hour	description	Range (inclusive) of hours in 24h format, in US eastern time. (UTC-5:00 in standard time and UTC-4:00 during daylight saving time).
	values	0-23
	default	`0-23`
	example	`hour="10" hour="9-17"`
dayofweek	description	range (inclusive) or a day in the week
	values	mon, tue, wed, thu, fri, sat, sun
	default	`mon-sun`
	example	`dayofweek="mon" dayofweek="sat-sun"`
Filters on Detected Faces
name	description	face of the person with the specified name is on screen
	values	name (See the people page for a complete list of people.)
	default	n/a
	example	`name="Kamala Harris"`
tag	description	face with the specified tag(s) is on screen. Multiple tags can be specified with commas.
	values	(non_)presenter (See the tags page for a complete list of tags.)
	default	n/a
	example	`tag="presenter"`
facecount	description	number of faces on screen
	values	1 or more
	default	n/a
	example	`facecount=2`
Filters on Closed Caption Transcripts
text	description	segments where the specified text pattern appears in the captions.
	values	keywords or phrases. Use `\|` for "or" (See Text Filter Syntax for more details of valid text patterns.)
	default	n/a
	example	`text="affordable care act"` `text="affordable care act \| obamacare \| obama care"`
textwindow	description	Specifies how much to dilate the time interval associated with text filter matches. If the text window is 0, then text filter selects exactly the video segment when a word or phase is being said. By increasing the "window" to larger than 0, it is possible to design queries where segments matching one filter need only be within a certain amount of time of a segment matching a text filter. For example, if the word "obamacare" is said and the text window is 1 second, then each instance of "obamacare" is converted to a 1 second interval centered around the time when "obamacare" is said. Note that for long windows, overlapping intervals are merged.
	values	keyword or phrase
	default	`1`
	example	`textwindow=10` (treat each text match as 10 seconds)

Text Filter Syntax

Listing words and phrases (word A or word B)

Sometimes a topic can be defined with multiple related or synonymous words/phrases. For example, the "European Union" can be also be referred to as the EU or E.U. in the captions. When this is the case, use the "|" character to delimit multiple words and phrases. For example, text="European Union | EU | E.U." will search for video segments where any of these three n-grams appear in the captions. This can be repeated for an arbitrary number of words and phrases.

Listing words and phrases (word A and/not word B)

You can also search for instances where words appear nearby using "&" (and). For example, to find instances of "United" near "Airlines", use text="United & Airlines". This can be chained; for example, text="United & Airlines & 737 MAX". Not ("\") works similarly; for instance, text="United \ States \ Kingdom" finds instances of "United" that are not near "States" or not near "Kingdom".

By default, the threshold for nearness is 15 seconds. This can be modified using the following ("::") syntax: text="United \ States :: 60", which sets the window for "\" to 60 seconds. Use "//" to change the window policy to tokens; for example, text="United \ States // 100" finds "United" with no instances of "States" within 100 tokens.

Query syntax and semantics

Text query "&" and "|" operators behave differently from AND and OR. The latter operate on intervals, while the former give back intervals which the latter operate on. The query text="United & Airlines" finds separate intervals of "United" and intervals of "Airlines", which are nearby. These intervals are of duration "textwindow" (by default, 1 seconds). In contrast, text="United" AND text="Airlines" finds intervals of "United" and "Airlines", and returns their exact time overlap.

The text grammar also supports basic composition of "&", "|", and "\". For example, text="United \ States \ Kingdom" is equivalent to text="United \ (States | Kingdom)", expressing instances of "United" that are not near either "States" or "Kingdom". Parentheses are necessary to separate clauses and operators may not be mixed in a clause. The full details of the text query grammar can be found here.

Inflections of words

Words can be used in many inflected forms. The simplest case is when words are singular or plural. To search for all inflected forms of a word without specifying them manually, surround the word with [...] brackets. For example, text="[truck]" will find instances of "truck", "trucks", and "trucking". If multiple words are surrounded by [...], then inflections will be found for any of the words in the brackets.

Time windowing around mentions

By default, text will precisely find the intervals of time during which a word or phrase is spoken. This means that each mention will likely contribute only a small fraction of a second of screen time to a query result. Sometimes it is useful for a caption-text query to match a wider region of time around the utterance of a word, for example if an query seeks examples where a person is on-screen within a specific amount of time of a word being stated. The textwindow parameter defines how much a time of time is dilated around a caption-text match. See the "Supported Query Filters" section for details.

Normalizing Query Results

Instead of computing screen time estimates in absolute time units (e.g., in minutes or hours), it can be useful to present query results as a proportion of the screen time of another query. The query language supports normalization of one query's computed time by another using NORMALIZE:

For example, the following query computes the fraction of the overall dataset that is from CNN:

The following query computes the fraction of time on CNN that a news presenter is on screen: