Labeling Methodology

This page documents the methodology used for extracting metadata from the cable TV news video dataset. Additional detail on our methodology can be found in our technical reports; see our about us or paper page.

Data Update
Face Detection
- Face Descriptors
- Face Identities
Commercial Segment Detection
Caption Time Alignment

Data Update

The Stanford Cable TV News Analyzer accesses video data provided by the Internet Archive's TV News Archive. The Internet Archive provides data on a 24 hour delay. As a result of this delay, plus the latency of additional data processing, new results appear on the Stanford Cable TV News Analyzer approximately 24-36 hours after a program's original air time.

Face Detection

We detected faces in video frames using the MTCNN [1,2] face detector. Due to the high cost of performing face detection on all frames in the dataset, we performed face detection on a subset of frames. Data prior to Jan. 1, 2019 is uniformly sampled every three seconds in a video. Data after Jan. 1, 2019 is sampled uniformly sampled at one frame per second. This process yielded a total of 306 million face detections (from Jan. 2010 to Jul. 2019). The Stanford Cable TV News Analyzer tabulates face screen time at the granularity of this sampling. For example, a face detection of Anderson Cooper in a single video frame (after Jan. 1, 2019) contributes one second to the estimate of Mr. Cooper's screen time in the video.

The location of each detected face is represented by an axis-aligned bounding box. To remind users that face detection is not occurring on all frames, the Stanford Cable TV News Analyzer's video player renders face bounding boxes in the frame in which a detection occurred, then fades out the box until the next sampled frame.

Validation: To estimate the precision and recall of the face detector, we manually counted the number of faces present in 2,500 randomly selected frames of the dataset (250 frames from each year from 2010-2019). We find that precision is 98.5% and precision is 74.5%. Recall is lower because the "ground truth" human annotations include all faces in the frame, including difficult to detect faces (e.g., out-of-focus, partially occluded, very small faces). A large fraction of recall errors are in frames with crowds (such as a political rally) where background faces are small and often partially occluded.

For each detected face, we compute the following per-face “tags” and descriptors:

Face Descriptors

We compute the 128-element FaceNet descriptor [3,4] from the pixels contained inside a face’s bounding box.

Face Identity Tags

We use the Amazon Rekognition Celebrity Recognition API to identify the detected faces. The API provides a face identity estimate only when its identification confidence score is greater than 0.5. We use all predictions above this 0.5 threshold, and do no additional thresholding. As a result of this process, the API returns an identity prediction for 45.2% of the faces in the dataset. Identities in the dataset for videos airing prior to Jan. 1, 2019 result from Celebrity Recognition API queries performed in September 2019. Identities for subsequent videos result from API queries made within a few days of the video's original air date. Amazon does not disclose the full list of individuals that can be recognized by the Amazon Rekognition Celebrity Recognition API.

To increase the percentage of faces with identity tags, we also add identity tags to faces that were not identified by the Celebrity Recognition API, but have close visual similarity to identified faces. This is accomplished with a nearest neighbor classifier on a per video basis; we take the faces that have no identity label provided by the Celebrity Recognition API, and find all of their neighbors in face embedding space within a given L2 distance, and then assign the unlabeled face the majority vote of those labels. In total, 55.5% of the faces in the dataset contain an identity tag. However, the Stanford Cable TV News Analyzer limits use of face screen time filters to individuals that have at least ten hours of estimated screen time as of Aug. 1, 2020.

Commercial Segment Detection

We detect commercial segments using an algorithm that scans videos for sequences of black frames (which typically indicate the start and end of commercials) and for video segments where caption text that is either missing or lower case. (We observed that these caption features are indicative of commercials in most videos in our dataset.) Source code for this algorithm is available at: https://github.com/scanner-research/esper-tv/blob/master/app/esper/commercial_detection_rekall.py. The algorithm is written using Rekall, an API for complex event detection in video.

From Jan. 2020 to Jul. 2019, there are 70,559 hours of detected commercials in the dataset, leaving 182,896 hours of program content. By default, video segments lying within commercials are excluded from query results in the Stanford Cable TV News Analyzer.

Validation: We performed human annotation of the location of commercial segments in a 20-hour sampling of the dataset. (6.3 hours of this time fell within commercial segments.) On this test set, the commercial detector achieves a precision of 93.0% and a recall of 96.8%. Precision and recall are computed based on per-frame commercial detector results. Specifically, the precision of commercial detection is computed as the faction of frames that the classifier correctly classified as commercials divided by the total number of frames the classifier estimated were part of a commercial segment.

Caption Time Alignment

We use the Gentle word aligner to perform sub-second alignment of words in a video's closed-caption caption to the video's audio track. (The source captions are only coarsely aligned to the video.) To perform alignment, we partition the video's audio track into one-second chunks and use Gentle to search for word-level alignment with caption text within +/- 10 seconds of this audio segment.

The Stanford Cable TV News Analyzer tabulates the screen time of caption-text queries using the duration of words determined via caption time alignment. For example, an utterance of the word "politics" that begins at 10:35.10 in a video and ends at 10:35.90 contributes 0.8 seconds to the estimate of the word's total screen time in the video. Since the Stanford Cable TV News Analyzer tabulates the screen time of caption words matching a query, the screen time estimate for a longer word (e.g., "Mississippi") will likely be greater than that of a shorter word (e.g., "Iraq") even if the two words are spoken the same number of times.

References

[1] MTCNN Source Page (https://kpzhang93.github.io/MTCNN_face_detection_alignment/index.html)

[2] Zhang, Kaipeng, et al. "Joint face detection and alignment using multitask cascaded convolutional networks." IEEE Signal Processing Letters 23.10 (2016): 1499-1503.

[3] FaceNet TensorFlow Implementation (https://github.com/davidsandberg/facenet)

[4] F. Schroff, D. Kalenichenko and J. Philbin, "FaceNet: A unified embedding for face recognition and clustering," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 815-823. doi: 10.1109/CVPR.2015.7298682 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7298682&isnumber=7298593

[5] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: a nucleus for a web of open data. In Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference (ISWC'07/ASWC'07).