Percentages are of your total class grade. Note: All times are 11:59pm est.
The major portion of your grade is based on the research project. Students will organize into teams of 1-2 students and work on a research project. It should take about 3-4 weeks to complete. Some possible ideas are described below.
Teams should consist of 1-2 people (we may adjust this depending on the size of the class). Once you have a project in mind, discuss it with Eugene in OH well ahead of the proposal deadline. We have also included a list of possible projects at the end of this document (you are not required to choose from these).
The primary requirement is that the project is related to something we have studied in this class and that they contain some element of research – e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.
Good class projects can vary dramatically in complexity, scope, and topic. The goal of the class is for you to experience the process of conducting research. Thus we care more about the “structural” parts of the research – is the related work section thorough? Is the project well-differentiated from prior work? Is it based on testable hypotheses with well-reasoned metrics? Does the evaluation actually answer the hypotheses? – and less about the “interestingness” of the project.
Choosing a research problem is very difficult, especially if you have not done so before. You may end up thinking of, and discarding many possibilities before finding the project you ultimately work on. Have a fuzzy idea? Want some feedback or help brainstorming a project? Come to office hours. We are here to help!
Finally, if you are a graduate student or have an existing research project, “reusing” or “adapting” it to this course’s topics is encouraged.
Your research proposal will contain an overview of the research problem, your hypothesis, first pass at related work, a description of how you plan to complete the project, and metrics to decide if it worked.
We have setup a template for your proposal on overleaf. Clone it into your team’s account to edit it. Make sure to change the title and author names, and include your team members’ UNIs.
The proposal will be evaluated based on the quality of the hypothesis – is it testable? Does it ask a question that is well motivated by user needs and/or related work? Is it achievable?
Submission
Your group will schedule 20 minutes to meet with Professor Wu to go over the project’s progress and receive feedback. The first 5 minutes will consist of a short 5 minute presentation with 4 slides (roughly 1 minute per slide). The rest of the time will be an open discussion and fielding questions.
Slides should cover:
Submission
You will submit a draft of your paper that should be between 4 – 6 pages. Please use the overleaf report template to get started. It already contains a scaffold of sections, and suggestions of what to include in each section. Your report is not beholden to these sections, so take the template as a starting point.
You will notice that the structure is different from the proposal, as there is more emphasis on the technical details and related work. Since this is a draft, the submission is not expected to be fully complete. However, you should have a fleshed out introduction, related work, and technical overview.
You should have a complete related works section, and have identified how the main relevant papers are similar and different from your project. Section 3 in the Battle and Scheidegger survey describes a way to systematically look for related work. A quick summary is:
P = initial set of papers by
keyword searching on google scholar/semantic scholar and
asking people
for paper p in P:
Look at p's related work and references, add relevant papers to P
Look at papers that cite p
Search on google scholar and click on "cited by"
Search on semantic scholar, click paper, and read Citations section
You should have a clear description of the technical details and anticipated potential issues, but may have not started implementation. You do not need to have run experiments yet, but should have a set of hypotheses as well as a potential experiment setup (which may change). If you have preliminary findings, that’s great! Please include those.
Tips:
Evaluation: your draft will primarily be evaluated by
Submission
Your team will prepare and present a project poster at the end-of-course showcase session. This gives you an opportunity to present a short presentation demo of your work and show what you have accomplished in the class!
Your presentation should be polished. Since there is still time until the final report, you are encouraged to also discuss ideas or challenges you are still considering.
Since you are presenting to your peers as well, make sure you give your colleagues enough context to understand your ideas. In addition to what you did, help your colleagues understand why you made your specific choices, and provide examples. It’s better to make sure the audience learns a few specific ideas than try to say everything. Come to office hours or contact the staff if you would like feedback.
Overall logistics
Your presentation should cover (in content, not necessarily one slide for each point)
Tips!
Submission Instructions
You will revise your draft and submit a conference-style report on your project, with maximum length of 12 pages. Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure your proofread and edit carefully!
Submission
The following are examples of possible projects – they are by no means a complete list and you are free to select your own projects. In fact, a common source of ideas is to take your experience from another domain, and combine it with ideas from human data interaction. Another approach is to take concepts from the papers we read, and apply them to another domain. Projects often come in several flavors:
Data interfaces are designed for specific tasks in mind. Yet, the multi-level typology paper shows that even characterizing a task can be very challenging, and we are not at the stage where tasks can be specified precisely enough to help with automated interface design. One task that is clear is hypothesis evaluation. If we know the hypothesis that the user wants to test, can a system help guide the user towards answering the hypothesis using available data? This project will help Professor Wu and Professor Remco Chang design and evaluate an initial version of such a system.
Debugging SQL queries is very difficult [1][2]. A possible hypothesis is that there is a mismatch between the declarative SQL language that the user writes in, and the underlying step-by-step execution plan. In fact, users often say that they “serialize” the query into steps and debug each step at a time. The idea of this project is to help SQL debugging by translating a SQL query into a sequence of data frame statements (perhaps, one per cell in a Jupyter notebook) and maintaining the bidirectional mapping. In this way, the user can individually inspect each data frame statement, edit them to fix the bug, and automatically map those edits back to the SQL query. It would also be good to identify the types of queries and edits that can be correctly supported by this approach.
Databases perform query optimization by leveraging the high level semantics of relational algebra/SQL. However many data analytics code are written imperatively, contain user-defined functions, or simply as for-loops over arrays of objects. Recent work at Columbia has extended the Graal compiler with database-style optimizations – in this way you can write normal code and get database-like optimization benefits. This project will help Professor Wu and Professor Ken Ross design and develop an interactive visualization system on top of this compiler to showcase its performance benefits.
Lineage tracks the correspondence between individual database records and the objects shown on the screen. There has been many many papers in the past that use lineage for all sorts of interesting applications, or propose ways to use lineage. Unfortunately, actually capturing lineage has historically been very very expensive.
Our group has built the first database engine called SmokedDuck that tracks lineage with negligible overhead. This makes it possible to use and operationalize lineage in interesting ways. A potential project is to read a “lineage application” paper, distill their ideas down, and reproduce/”modernize” them on top of Smoked Duck. This can mean making a previous work super fast and interactive or creating a re-usable library/system. Some example papers include:
Another project is to explore the relationship between lineage and coordinated views more deeply. What could an easy-to-use language or library for defining view coordination look like? The recent Nebula paper attempts to taxonomize coordination, but is fairly complex. Could it be easier to separate the data flows (that lineage expresses) and the visual design?
A third project is to port SmokedDuck to Javascript via the WASM compilation tool-chain, and develop a visualization library/system based on this functionality. See Professor Wu for pointers to DuckDB WASM compilation guides.
Comparison is considered one of the three low level “Why” tasks in Munzner’s task typology. Despite many design guidelines for creating visualizations to aid comparison, it does not exist as an interaction (for instance, in Munzner’s “How” classification). Comparison is challenging because it is a function over the outputs of a data transformation and visualization process, and thus requires understanding the semantics of those outputs. Our lab has recently developed the first interactive comparison technique for Tableau-like visualizations, build on top of a novel language called View Composition Algebra (VCA). Potential projects are to expand its functionality beyond simple group-aggregation queries, to support different design considerations, or explore how best to integrate the algebra into a visualization library. See Professor Wu for details and a copy of the VCA paper.
There are many ideas from the world of web design that we can bring to data interfaces. As one example, there are many tools to difference versions of a web design or an image, however what would differencing versions of an interactive visualization look like? Willett’s paper on design hand-offs discusses the challenges due to potential changes in the data processing and/or visual design layers. Given two versions of a visualization (along with the code and data), explore the design space to help the user understand what changed.
Datasets in the wild often come with a text or PDF file that describes the data. For instance, UCI data repository, Kaggle, and government data often include a file that describes the schema and columns and codes. Given a data file (say in a standardized format) and a metadata file (say in text), can you automatically attribute and annotate the data file with relevant metadata, and design a dataset exploration interface that better surfaces this information?
Precision Interfaces, or PI2, (video here) is a project to generate interactive visualizations from a small number of example SQL queries. Below are a number of promising extensions to the project. If interested, contact yiru chen directly.
The goal of this project is to integrate PI2 into Jupyter notebook, explore the interaction designs for using PI2 in Jupyter, and conduct a user study to understand the pros and cons of an automatic interface generation tool as part of data analysis.
PI2 uses a fixed cost model to generate interfaces. However, users may have preferences in terms of the interface design, visualization types, widgets, or layout. For instance, they may like some views and widgets, but want to re-generate the rest; or they may like the layout but not the chosen interactions; or they may feel that PI2 over-generalized certain parts of the interface. Explore the design space to help users specify these preferences, and extend the PI2 interface generation algorithm to support this customization.
PI2 currently assumes SQL queries as input. However, writing example SQL queries can be tedious if the user wants to quickly author a new interface. Perhaps there is a concise SQL-like templating language that can be in lieu of the current input sequence of SQL queries. Alternatively, it would be good to support data frame programs, since they are very similar to SQL already.
Benchmarking interactive visualization systems is very hard, and the current benchmark called IDEBench is limited to 4 simple interactions and visualizations that they use to generate a workload. In contrast, PI2 can generate an arbitrary number of visualization interfaces automatically! Explore how we can use PI2 and its generated interfaces to create a new benchmark.
TBA
How are data and analyses referred to and described in scientific work? When data is presented as figures or tables, how is it referred to? What are the verbs and nouns? Is there a universal set of ways that figures are described (e.g., in terms of comparisons? in relative terms? ). This can serve as the evidence for a new data analysis language.
ArXiV releases dumps of the submitted papers. Identify the papers that include latex source files, parse the documents to find the charts, captions, and references to those charts in the text.
Related Works