COLUMBIA UNIVERSITY COMS W6998
SYSTEMS FOR HUMAN DATA INTERACTION

A4: Play with Wrangling/Extraction

Edits

In this assignment, you will try out interfaces for data and text extraction.

Wrangler: data prep

Trifacta Wrangler is an interface that employs a programming-by-example interaction modality.

Datasets:

2013films.txt: A dataset of all the movies in 2013 from January to March. It contains Movie name, Production house, Genre, Publisher and other details.

Access Trifacta for students from their website.

Some tips using Wrangler:

Your Task

Load the 2013films dataset into Wrangler, and then use the tool to generate output as follows, i.e., Movie name, Production/Distribution house, Director, Genre and publisher.

A dark truth, Magnolia Pictures, Damian Lee, Action,ComingSoon.net
Table No. 21, Eros International, Aditya Datt, Thriller, BoxOfficeMojo.com
[...more rows...]

For the purpose of explanation columns are separated by ||. You can choose any pattern to extract information.

  1. Movie name can be identified as first column in every line formatted as [[ <movie name> ]]
  2. Production/Distribution house is the following column that is formatted as [[< Production house>]]
  3. Director name can be identified with (director) tag that follows the name. Just extracting one name near the tag would be enough.
  4. Genre is present in the next column but make sure to extract only second part that is separated by the | operator. For eg. in [Action film|Action] relevant information is Action
  5. Publisher name can be identified in the last column with format publisher=<publisher name>
  6. It may help to skip first few lines that contains html code, so that you process actual records.

Questions you will submit in the google form

  1. Which production/distribution house produced the maximum movies?
  2. What is the most popular genre?
  3. Share your experience with Trifacta. Which part of the cleaning procedure does it make easier? Which part is hard? Any comments on Trifacta, pros or cons or further improvements.

Notes

Files you need to submit

DBT: Extract-Load-Transform using SQL

We will be using the Incarceration Trends dataset collected by the Vera Institute. We will be using the county-level data file. Open the Codebook. Starting from page 11, it describes the attributes in the dataset. Notice how a lot of attributes are actually encoded in the attribute names. For instance, aapi_pop_15to64 is a numeric value that stores the population count for Asian American/Pacific Islanders aged 15 to 64. This should clearly be three separate attributes population, ethnicity, age. The same is true for almost all of the measures.

You will use DBT to turn this dataset into a well-formed table.

Note:

This type of data, where lots of metadata is encoded in the attributes and described in a codebook, is incredibly common in public data repositories. It’s an open problem how to automatically/best transform these types of data into well-formed tables.

Questions you will answer in the google form:

  1. What does DBT do well in solving this problem? What limitations does it have?
  2. What would Wrangler do well in solving this problem? What limitations would it have?
  3. What might be a human data interface design to aid this task? Consider the language, the interface design, the interactions, the tasks.

Notes

Submission Instructions

Read BEFORE Submitting

Section-1: Wrangler
Section-2: DBT
Note:

Submit your responses here!