Update logs: (This handout will be updated as we iron out the logistics for the final project.)
Jump to: Important Dates, Learning Goals, Project Types, Project Requirements, Stages, Rubric
The final project is a way for you to explore and gather insights from some dataset(s) of your choosing as an actual data scientist would.
In groups of four, you will extract, process, analyze and create visualizations of your data and present the insights from your work at a symposium at the end of the semester.
Along the way, you will deliver small reports on aspects of your project. Each group will
be assigned a mentor TA who will evaluate and guide each team. Group proposals and final presentations will be
given to Lorenzo and the HTAs.
The goal of the project is for each group to properly define a hypothesis or engineering project and explore
a dataset fully.
Generally speaking, projects fall into one of two categories: hypothesis-testing and prediction. We provide a general rubric meant to encompass both types of projects, but keep in mind that you will be evaluated on the appropriateness of the tools you use, given your stated goal. E.g. projects which aim to test a hypothesis should use different ML/stats techniques than projects which aim to make predictions. The expectation is that you present a coherent, motivated final product.
The goal of the project is to make a generalizable claim about the world. Your hypothesis should be based on some prior observation or informed intuition. The best projects in this category are motivated by existing theories, evidence, or models from other scientific disciplines (e.g. public health, political science, sociology). Before pitching the project, you should ask yourself: Why is this hypothesis interesting? Who would care about the conclusions of my study? How would these findings potentially affect our understanding of the world at large? Successful projects from last year which fall into this category are below:
The goal of the project is to make accurate predictions about future/unseen events. The best projects in this category are motivated by useful commercial or social applications. "If we could predict X, we'd be able to do Y better." Before pitching the project, you should put yourself in the role of a data science consultant and ask yourself: Who would be hiring me to do this project? Why would the stakeholders care about my ability to do this well? What is the realistic setting in why my predictions will be used, and can I make my evaluation match that setting? Successful projects from last year which fall into this category are below:
Groups which are doing capstone can do either a hypothesis-testing or a prediction project, but will be required to have a substantial engineering and UI/interactive component. The project pitch should define a clear system or software product spec. The end project should include something "demoable", e.g. a web application or local application that interacts with users in real time.
Note that every project will be required to meet a minimal threshold in each of three components: a data component, an ML/stats component, and a visualization component. But not every project will invest equally in all three sections. Projects which take on more ambitious data scraping/cleaning efforts will be forgiven for having skimpier visualizations. Projects with ambitious visualizations or UIs will be forgiven if their statistics are more basic. But the project still needs to be complete--you have to understand your data, and be able to answer intelligently when asked about the limitations of your methods and/or findings.
Each group will meet with their assigned mentor TA to discuss their project proposal. Groups should be prepared to talk about their project ideas and discuss the questions outlined below.
The final project proposal will need to contain a summary of the following:
After you had the mentor check-in to discuss the project proposal, you should have an idea of how you'll collect and clean your data. For different projects
this will vary. If you are scraping, you should write and run your scraper for the data deliverable (exceptions apply if your project requires the scraper to run continuously e.g. to get updated data throughout the semester).
If you are using existing datasets, you should clean, join, or organize anything you need to. In either case, you should submit the data in some clean access method like SQL,
Pandas, JSON, or similar datastore, to be in a position to begin analysis and/or modeling.
Along with your data, you should submit a Socio-historical Context and Impact Report for your project. Instructions for this report are described in detail below.
Concretely, you should have the following on your submission:
README.md
file located at data_deliverable/reports/tech_report
in your GitHub repo), answering the following questions:README.md
file located in your Github repo. Instructions for this report are described in detail below. The goal of this report is to:
Your Social Context and Impact Report should include two sections: socio-historical context and ethical considerations. Each section should be about 1-2 paragraphs. Below, we've listed guiding questions that you should address in each section. Not all questions may be relevant to all projects.
This report will require you to research information beyond your data. Consult at least 3 outside sources. List your sources in a works cited section at the end of your report. Your sources and works cited section can be in any format.
Please reach out to the STAs (Lena and Gaurav) via Piazza if you’re struggling with finding relevant information for your project. We’ll release a guide soon about how to approach research and how we’ll grade these reports.
Research and describe how your project interacts with its socio-historical context. This section should address at least 3 out of the 4 bullet points below.
Discuss a few ethical and societal issues related to your data and analysis. This section should address at least 4 out of the 6 bullet points below. Focus on the questions that feel most relevant to your project.
Now that you have your data ready, the group will use statistics or machine learning techniques to answer one of their primary questions. Coupled with this is it is expected that you create a vizualization of this result. The visualization may be a table, graph or some other means for conveying the result of the question. We expect that you will not be able to provide a complete solution for any prediction or hypothesis goal with a single ML or statistical model. Concretely you should have the following on your project page:
/analysis_deliverable/visualizations
folder in your repo.
Depending on your model/test/project we would ideally like you to show us your full process
so we can evaluate how you conducted the test!You are required to complete:
We want every group to write a summary of their work which could be sent out as a one-page standalone document. We provide two example formats below and encourage you to follow these templates exactly. The goals here are:
abstract.pdf
.
Here are examples we have for hypothesis and prediction projects.
If you are completing the capstone requirements, you should handin the code for your interactive tool with your poster. To reiterate, your interactive element can be anything that takes user input and updates/adjusts. It can be commandline, webapp, D3, or any other interface. You will be expected to demo this element as part of your presentation.
Due to the logistics of the semester, we will not have live presentations of the projects this year. Rather we ask you to submit a video recording of such a presentation.
60 | big problems, e.g. missing large portions of deliverable |
70 | The team's analysis suggests misunderstanding of the dataset and/or problem, usually in one of the following ways: 1) the methods chosen do not make sense given the stated goal of the problem (e.g. proposed to test a hypothesis but then focus effort on training a classifier) 2) the techniques used are significantly misused (e.g. testing on training data, interpreting r^2 as evidence for/against a hypothesis) 3) presentation of the results raises many questions that go unanswered (see examples listed in description of 75) 4) discussion of results/descriptions of methods are flatly incorrect 5) other errors that similarly indicate a lack of understanding. |
75 | The team is familiar with the data and/or problem, but hasn't demonstrated a clear understanding of both in conjunction (e.g. how dataset properties/processing decisions relate to the problem). The team knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance, but the some discussion or presentation of the results raises many questions that go unanswered. E.g. charts show gaps/skews/outliers that are not discussed or addressed, metrics take extreme values that are not explained, results behave unintuitively (test acc > train acc) and no comment is made or explanation provided, pvals/coefficients are misinterpreted, etc. Discussion overall is weak. |
80 | The team is familiar with the data and/or problem, but hasn't demonstrated a clear understanding of both in conjunction (e.g. how dataset properties/processing decisions relate to the problem). The team knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance, but rationale isn't entirely clear or convincing. Results are not clearly discussed in relation to the overall goal of the project, and/or discussion feels "out of the box". Team has a plan for where to go next. |
85 | The team is familiar data and the problem, and knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance with rationale as to why those technique are chosen. Results are related to the overall goal of the project, but the discussion feels "out of the box" and/or lacks depth. Team has a good plan for where to go next. |
90 | The team understands the data and the problem, and knows how to run the first most logical test needed to test hypothesis/demonstrate model performance. Results are interpreted in relation to the overall goal of the project, but maybe not as deeply as would be ideal (e.g. some outliers/data quirks are not commented on). Interpretation is precise and scientific, claims are evidence-based or hedged appropriately. Team has a good, informed plan for where to go next. |
95 | Clear demonstration that the team understands the data and the problem, and knows how to run not just the first most logical test, but also a logical follow up (including motivation for follow-up analysis, refinement of question when needed). Obvious weirdness in data (e.g. outliers, skews) are noted and explained. Results are interpreted "one level up" from the literal output of the test, and discussed in relation to the overall goal of the project. Interpretation is precise and scientific, claims are evidence-based or are hedged appropriately. Good plan for next steps. |
Below is an approximate breakdown of the weight of the project deliverable components. However, your project will be graded holistically. If your final product is A-level, you can recieve an A on the project overall, even if earlier components were weak. (The point of early feedback is to learn and grow, afterall.) Focus on doing good work, and you will be fine.
Points | Component | Grader |
---|---|---|
5 | Project Proposal | Lorenzo + HTAs |
10 | Data Deliverable | Lorenzo + HTAs |
20 | Analysis Deliverable | Lorenzo + HTAs |
50 | Poster Presentation to Lorenzo + HTAs | Lorenzo + HTAs |
15 | Mentor TA Evaluation | Mentor TA |