Labeling Methodology

This page documents the methodology used for extracting metadata from the cable TV news video dataset. Additional detail on our methodology can be found in our arXiv technical report.


Data Update

The Stanford Cable TV News Analyzer accesses video data provided by the Internet Archive's TV News Archive. The Internet Archive provides data on a 24 hour delay. As a result of this delay, plus the latency of additional data processing, new results appear on the Stanford Cable TV News Analyzer approximately 24-36 hours after a program's original air time.


Face Detection

We detected faces in video frames using the MTCNN [1,2] face detector. Due to the high cost of performing face detection on all frames in the dataset, we performed face detection on a subset of frames. Data prior to Jan. 1, 2019 is uniformly sampled every three seconds in a video. Data after Jan. 1, 2019 is sampled uniformly sampled at one frame per second. This process yielded a total of 306 million face detections (from Jan. 2010 to Jul. 2019). The Stanford Cable TV News Analyzer tabulates face screen time at the granularity of this sampling. For example, a face detection of Anderson Cooper in a single video frame (after Jan. 1, 2019) contributes one second to the estimate of Mr. Cooper's screen time in the video.

The location of each detected face is represented by an axis-aligned bounding box. To remind users that face detection is not occurring on all frames, the Stanford Cable TV News Analyzer's video player renders face bounding boxes in the frame in which a detection occurred, then fades out the box until the next sampled frame.

Validation: To estimate the precision and recall of the face detector, we manually counted the number of faces present in 2,500 randomly selected frames of the dataset (250 frames from each year from 2010-2019). We find that precision is 98.5% and precision is 74.5%. Recall is lower because the "ground truth" human annotations include all faces in the frame, including difficult to detect faces (e.g., out-of-focus, partially occluded, very small faces). A large fraction of recall errors are in frames with crowds (such as a political rally) where background faces are small and often partially occluded.

For each detected face, we compute the following per-face “tags” and descriptors:


Face Descriptors

We compute the 128-element FaceNet descriptor [3,4] from the pixels contained inside a face’s bounding box.


Face Gender Tags

We tag all faces with an estimate of their presented binary gender (male/female) using a binary classifier that operates on a face's FaceNet descriptor as input. We note that treating an individual’s gender as a binary quantity, as well as assessing gender solely from an individual's appearance, is a grossly simplified treatment of a complex topic. However, we believe that binary classification can still provide useful insights into the presentation of cable TV news. (Further discussion about our decision to include gender tags is available on our FAQ.)

To train the classifier, we manually annotated the presented binary gender of 12,669 faces selected at random from the dataset. Each face was annotated by a single human annotator. (No additional consensus protocol was used.) We used these ground-truth annotations to train a binary classifier based on the precomputed FaceNet descriptors. Our implementation uses a k-NN classifier with classification results determined by majority vote with k=7.

Validation: We computed the classifier's agreement with a single human annotator on a test set of 6,000 faces. Faces were sampled from the first ten years of the dataset (Jan. 1, 2010 to Dec. 31, 2019). We stratified samples by year and channel, selecting 200 faces at random from each of the 30 slices (10 years x 3 channels). The human annotator judged 4,109 of the 6,000 test set faces (68.5%) as male-presenting individuals and 1,891 (31.5%) as female-presenting individuals.

The confusion matrix for our gender estimation model is given below. Overall the model agrees with human annotations 97.2% of the time. When a face is judged to be male-presenting by a human, the model agrees with the human 98.8% of the time. When a face is judged to be female-presenting, the model agrees with the human 93.8% of the time.

The table below gives the accuracy of our presenting-gender estimation model for each (year, channel) slice. Accuracy is judged as agreement with results provided by a human annotator.

CNN Fox News MSNBC
Year Accuracy (overall/male/female) Accuracy (overall/male/female) Accuracy (overall/male/female)
2010 97.097.995.0 98.5100.095.1 99.0100.096.9
2011 97.099.392.2 98.599.396.6 98.0100.091.5
2012 96.599.291.3 95.597.192.2 97.597.398.1
2013 97.598.595.2 96.598.592.9 97.098.693.3
2014 98.098.397.5 98.0100.093.7 99.0100.097.5
2015 96.598.592.5 98.0100.093.7 95.096.690.4
2016 95.098.489.6 96.598.691.7 96.597.993.1
2017 97.598.595.4 98.097.4100.0 98.0100.093.4
2018 98.599.396.6 98.099.395.1 95.597.092.5
2019 96.0100.088.7 96.599.291.0 96.598.691.9


Face Identity Tags

We use the Amazon Rekognition Celebrity Recognition API to identify the detected faces. The API provides a face identity estimate only when its identification confidence score is greater than 0.5. We use all predictions above this 0.5 threshold, and do no additional thresholding. As a result of this process, the API returns an identity prediction for 45.2% of the faces in the dataset. Identities in the dataset for videos airing prior to Jan. 1, 2019 result from Celebrity Recognition API queries performed in September 2019. Identities for subsequent videos result from API queries made within a few days of the video's original air date. Amazon does not disclose the full list of individuals that can be recognized by the Amazon Rekognition Celebrity Recognition API.

To increase the percentage of faces with identity tags, we also add identity tags to faces that were not identified by the Celebrity Recognition API, but have close visual similarity to identified faces. This is accomplished with a nearest neighbor classifier on a per video basis; we take the faces that have no identity label provided by the Celebrity Recognition API, and find all of their neighbors in face embedding space within a given L2 distance, and then assign the unlabeled face the majority vote of those labels. In total, 55.5% of the faces in the dataset contain an identity tag. However, the Stanford Cable TV News Analyzer limits use of face screen time filters to individuals that have at least ten hours of estimated screen time as of Aug. 1, 2020.


Commercial Segment Detection

We detect commercial segments using an algorithm that scans videos for sequences of black frames (which typically indicate the start and end of commercials) and for video segments where caption text that is either missing or lower case. (We observed that these caption features are indicative of commercials in most videos in our dataset.) Source code for this algorithm is available at: https://github.com/scanner-research/esper-tv/blob/master/app/esper/commercial_detection_rekall.py. The algorithm is written using Rekall, an API for complex event detection in video.

From Jan. 2020 to Jul. 2019, there are 70,559 hours of detected commercials in the dataset, leaving 182,896 hours of program content. By default, video segments lying within commercials are excluded from query results in the Stanford Cable TV News Analyzer.

Validation: We performed human annotation of the location of commercial segments in a 20-hour sampling of the dataset. (6.3 hours of this time fell within commercial segments.) On this test set, the commercial detector achieves a precision of 93.0% and a recall of 96.8%. Precision and recall are computed based on per-frame commercial detector results. Specifically, the precision of commercial detection is computed as the faction of frames that the classifier correctly classified as commercials divided by the total number of frames the classifier estimated were part of a commercial segment.


Caption Time Alignment

We use the Gentle word aligner to perform sub-second alignment of words in a video's closed-caption caption to the video's audio track. (The source captions are only coarsely aligned to the video.) To perform alignment, we partition the video's audio track into one-second chunks and use Gentle to search for word-level alignment with caption text within +/- 10 seconds of this audio segment.

The Stanford Cable TV News Analyzer tabulates the screen time of caption-text queries using the duration of words determined via caption time alignment. For example, an utterance of the word "politics" that begins at 10:35.10 in a video and ends at 10:35.90 contributes 0.8 seconds to the estimate of the word's total screen time in the video. Since the Stanford Cable TV News Analyzer tabulates the screen time of caption words matching a query, the screen time estimate for a longer word (e.g., "Mississippi") will likely be greater than that of a shorter word (e.g., "Iraq") even if the two words are spoken the same number of times.


References

[1] MTCNN Source Page (https://kpzhang93.github.io/MTCNN_face_detection_alignment/index.html)

[2] Zhang, Kaipeng, et al. "Joint face detection and alignment using multitask cascaded convolutional networks." IEEE Signal Processing Letters 23.10 (2016): 1499-1503.

[3] FaceNet TensorFlow Implementation (https://github.com/davidsandberg/facenet)

[4] F. Schroff, D. Kalenichenko and J. Philbin, "FaceNet: A unified embedding for face recognition and clustering," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 815-823. doi: 10.1109/CVPR.2015.7298682 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7298682&isnumber=7298593

[5] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: a nucleus for a web of open data. In Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference (ISWC'07/ASWC'07).