Percentages are of your total class grade.
- 2/12: extended initial prospectus deadline by one week to 2/19
The major portion of your grade is based on the research project. It should take about 3-4 weeks to complete.
Teams should consist of 1-3 people. In addition, if you have a project in mind, please indicate briefly (1–2 sentences) what you are thinking. We have included a list of possible projects at the end of this document although you are not required to choose from these.
Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research – e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.
Your ultimate research paper will describe the research problem, importance, hypothesis, related works, technical details and evaluation. The prospectus is a sketch to get you to think about these aspects. You will focusn on describing a research problem, and your hypothesis. You will also provide a first pass at related work, a short 2 paragraph description of how you plan to complete the project, and metrics to decide if it worked.
You should meet with Professor Wu prior to deciding your project.
Your prospectus should follow the example:
- Rename the filename of your prospectus to the following format, UNIs should be in alphabetical order.
- Click here to upload the file by 2/19 11:59PM EST
You will submit an updated version of your prospectus that contains a revised introduction (problem statement, hypothesis), and a substantially fleshed out related work section. It should clearly articulate the novelty of the problem with respect to state-of-the-art. You will need to find and review related literature, and look for software tools that may be related to your problem.
Some helpful tips:
- Few papers directly solve you problem, so you will need to be creative in finding related papers
- Papers can be relevant in many ways. They may
- help motivate your problem
- provide ideas that help you approach your problem
- directly solve your problem, but you have a better approach
- solve a similar technical problem, even though the application and use case are different
- solve a different problem, but their technique may be useful for your problem
- semanticscholar.org for a given paper, it lists papers that cite it
- google scholar
- other papers’ related works sections
- Rename the filename to the following format, UNIs should be in alphabetical order.
- Click here to upload the file by 3/4 11:59PM EST
Prototype Check in
Your group will schedule 20 minutes to meet with Professor Wu to go over the project’s progress and receive feedback. Prepare a short 5 minute presentation with 4 slides (roughly 1 minute per slide):
- Problem and motivation
- Related work and challenges
- Progress so far
- Plan for rest of the project
Your team will prepare and present a project poster at the end-of-course poster session. This gives you an opportunity to present a short demo of your work and show what you have accomplished in the class!
- Attend and present at the poster session.
- Give a short 3 minute talk about your project
- 9 slides x 20 sec per slide
You will prepare a conference-style report on your project with maximum length of 12 pages (10 pt font or larger, one or two columns, 1 inch margins, single or double spaced – more is not better.) Your report should expand upon your prospectus and introduce and motivate the problem your project addresses, describe related work in the area, discuss the elements of your solution, and present results that measure the behavior, performance, or functionality of your system (with comparisons to other related systems as appropriate.)
Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure your proofread and edit carefully!
- Rename the filename to the following format, UNIs should be in alphabetical order.
- Click here to upload file by 5/10 11:59PM EST
The following are examples of possible projects – they are by no means a complete list and you are free to select your own projects. In fact, a common source of ideas is to take your experience from another domain, and combine it with ideas from human data interaction. Another approach is to take concepts from the papers we read, and apply them to another domain. Projects often come in several flavors:
- Research project: model an unsolved problem, propose or extend an algorithmic solution, evaluate and report findings.
- Design: identify an underserved data problem for which a sound, composable interface doesn’t exist, propose an interface and interaction design, build and evaluate it.
- Fill a gap: think about something useful that should be easily doable, but is painful or impossible with current state of the art. Fill that gap.
New Querying Interfaces
Databases are on the horizon. However, a major limitation is that the query interface is incredibly impoverished. How do you specify that you want to find red cars that move along a trajectory? Or to look for relationships between two objects over time? Certainly not by writing SQL-like text queries. The challenge is that video is fundamentally 3D, but query interfaces are 1D.
- Idea 1: the core abstraction in relational algebra is Joins. In video, it is likely also joins, but for the same image across video frames, or the relationship between objects across video frames. The nature of trajectories, positioning, and timing are all core aspects to relating concepts in video. Propose and implement a prototype to help users express video joins.
- Idea 2: VR can render videos as 3D objects. What does a query language look like if designed for VR? What types of joins, or filtering, make sense? You should have VR experience.
What We Talk About When We Talk About Data
How are data and analyses referred to and described in scientific work? When data is presented as figures or tables, how is it referred to? What are the verbs and nouns? Is there a universal set of ways that figures are described (e.g., in terms of comparisons? in relative terms? ). This can serve as the evidence for a new data analysis language.
Analyze papers in ArXiV or Viziometrics for their figures and captions and surrounding text (ArXiV provides LateX files)
A Task-oriented Language
Vsualization tools and languages such as Polaris, Vega-lite, and others focul on helping users specify the layout, visual encodings, and implicitly, the grouping and aggregations, of their data. However, choosing the approriate aggregations, layouts, and visual encodings to answer a specific analysis task as quite challenging. For instance, suppose a dataset contains attributes A and B. If the task is “compare A and B”, then at first glance, a scatter plot makes sense. However, what if B only contains the two values “1990” and “2000”? Then, it makes more sense to compare the distributions of A for the years 1990 and 2000. Design a language that makes it easy for users to specify the task, and a compiler that generates the best visual presentation of the data for the task.
Precision interfaces analyzes query logs and generates custom interaction components from the logs. The goal is to scalably generate dozens or hundreds of custom interactive analysis interfaces for any analysis found in a log.
- Precision interfaces is currently language agnostic and does not take into account the database nor the database contents. Adapting the system to make weak but general assumptions about the nature of query plans, data, and query results can potentially improve the usability and usefulness of the generated interfaces.
- Visualization design algorithms such as Draco propose ways to measure the “appropriateness” and “effectiveness” of a visualization. HCI research has studied UI complexity for software interfaces based on ideas such as GOMS and Fitt’s law. Given a candidate interactive visualization interface (views and interactions) as well as a “workload” of queries users want to express, devise an “interactive interface appropriateness” measure.
Core Data Processing for Viz
- Perceptual push-down: why compute what cannot be seen? Our prior perceptual studies have found interesting trade-offs between different approximation techniques. Build on our findings and prototype a system that intelligently picks between different approximation and optimization options.
- Request Probabilities: instead of the typical request-response model of interaction, what if the client constantly sens probability distributions of what the user might do? What if the server constantly sends data to the client at maximum bandwidth?
- How does a data processing system execute a probability distribution of queries?
- What data should the server send to the client?
- Run studies: What are humans able to perceive anyways? Run perception studies to build user perception models that could be used for perceptual push-down.
- Pick a class of visual analyses, and get it to scale to 10M+ points in the browser using technologies such as Arrow.js.
PDFs + tables
- Public datasets (UCI ML data, government datasets) are often accompanied by description files that describe the attributes and the contents. Automatically identify the segments in the description files that relate to attributes in the dataset, and create a tool for users to make use of these metadata files to assist them as they learn about a new dataset. Make it work for some simple domains (datasets)
Query The Web
- Web pages are simply views over an underlying dataset (e.g., Amazon is a product database that renders product information) combined with query interactions (e.g., filter by clicking on a product category). However the interactions are fixed by the developer. Identify the schema of a website’s data, let users write SQl queries over the schema, and make it work.
Which Optimization Makes Sense?
- It’s currently unclear whether sampling or data cubes or other optimizations are the most appropriate for any given visualization + interaction. Run studies to better understand the trade-offs.
- Bonus: use trade-offs to recommend optimizations for new interactive visualizations.
Run some perceptual studies:
- What are humans able to perceive anyways? Run perception studies to build user perception models that could be used for perceptual push-down.
- Related: pfunk, Section 3.2 of CIDR paper
Data file formats
- Given a random binary or text data file, it’s a huge amount of work to identify a scehma and extract structured data from it. But there are plenty of binary and text data files to learn from! Train a deep learning/parsing model over a large variety of data files/formats and use the trained model to “parse” a new data file.
- Augment, not Replace: I suspect that analysts don’t want to perform NLP/voice-only data analysis, but would rather use voice to augment their programming-based analysis? For example, if the analyst asks “what’s that?” then it probably has to do with the part of the visualization that the cursor is pointing to. Survey existing human computer interaction literature on multi-modal approaches to data analysis (or run an informal user study) and build a prototype using Alex/NLP that augments a data scientist’s job. Some ideas of what to augment:
- While a user explores an interactive visualization, automatically zoom in, highlight data, generate new views, etc based on the user’s comments.
- Data science analysis session
- Checklist Manifesto: Customer service representatives (and most chat bots) follow a fill-in-the-blank rubric when communicating with users/customers. The purpose is to extract the most information to solve your problem in the least amount of time. Given a collection of chat logs, can you infer an optimized rubric?
- Google Time: With Google Maps, people can browse the world in their laptop. The aim of Google Times is to do the same thing, but for time instead of space. The project is made of two parts: 1. Extract as many dates as possible from public data sets, to obtain a huge database of dates, 2. Create a browsing system to explore this timeline in real time.
- Audification: While data visualization is well understood, its small cousin audification is still in its infancy [https://en.wikipedia.org/wiki/Audification] The aim of this project is to answer the question: what is the grammar of audification? What would the “Tableau of audio” look (sound) like?
Explanation and Cleaning
- But, Why? Identify a context during data exploration (either in a visualization system, or via any other modality) where the user will natually ask “why?” and expect an explanation. Formalize the context, formalize the problem and develop a prototype solution.
- Cleaning and Extraction Pushdown: Data collected from the web (e.g., from a form) is typically used by downstream applications for a variety of purposes such as training data for models, or to analyze using queries. However if the data is not collected and validated appropriately, then the analyst needs to perform expensive data cleaning to fix errors, or extract structured information from free-form text. Is there a way, given the downstream applications or the existing data cleaning steps, to augment the input form so that the submitted data is already clean and in the appropriate form?
- Will it Clean? Even automatic error detection is notoriously difficult due to the ambiguounotion of what “clean” means. However in data science applications, the test data for the prediction model provides a crisp notion of “clean” and has been used in BoostClean to perform automatic error detection and cleaning. BoostClean simply worked for simple static datasets: extend its ideas to streaming datasets where the errors may change over time.
- Excel Sucks: Many many very important datasets are shared as a big collection of excel files. For example, the Equality of Opportunity Project shares their data as 6 - 15 excel files for each category, and you end up needing to figure out how to download all of them, identify the schema, and join them together to even start exploration. Your project is to fix this. Let me point to a website with links, get the excel files, automatically infer the joins (they may be at different granularities such as county and state!) to produce a single set of database tables to query.
Recommendations and Predictions
- Causality and viz: Recently, the ML community has made great progress in the field of causality detection, i.e., understand what variable causes another variable in a data set see this paper. Can these methods help recommend interesting visualization views?
- Text and viz: In many cases, datasets come with a text description of what they contain. For instance, UCI repo data often include a file that describes the columns. How can you mine this information to recommend interesting new visualizations? Can you make it better with external knowledge, e.g., a knowledge base or a Web crawl?
- Predicting crime: You work for the FBI, you lead a team of 30 agents, and you just discovered this dump of dark web marketplaces:
Where will you send your agents?