Important Dates

  • Groups + Proposals Due: 11:59pm 02/09/2021 ET Form
  • TA Check-in (proposal): Feb. 12-14, 2021
  • Data Deliverable Due: 11:59pm 03/07/2021 ET
  • TA Feedback (data): Mar. 8-10, 2021
  • Analysis Deliverable Due: 11:59pm 04/10/2021 ET
  • TA Feedback (analysis): Apr. 11-13 2021
  • Final Deliverable + Poster Due: 11:59pm 04/18/2021 ET


Learning Goals

The final project is a way for you to explore and gather insights from some dataset(s) of your choosing as an actual data scientist would. In groups of four, you will extract, process, analyze and create visualizations of your data and present the insights from your work at a symposium at the end of the semester. Along the way, you will deliver small reports on aspects of your project. Each group will be assigned a mentor TA who will evaluate and guide each team. Group proposals and final presentations will be given to Lorenzo and the HTAs.

The goal of the project is for each group to properly define a hypothesis or engineering project and explore a dataset fully.


Project Types

Generally speaking, projects fall into one of two categories: hypothesis-testing and prediction. We provide a general rubric meant to encompass both types of projects, but keep in mind that you will be evaluated on the appropriateness of the tools you use, given your stated goal. E.g. projects which aim to test a hypothesis should use different ML/stats techniques than projects which aim to make predictions. The expectation is that you present a coherent, motivated final product.


Hypothesis-Testing Projects

The goal of the project is to make a generalizable claim about the world. Your hypothesis should be based on some prior observation or informed intuition. The best projects in this category are motivated by existing theories, evidence, or models from other scientific disciplines (e.g. public health, political science, sociology). Before pitching the project, you should ask yourself: Why is this hypothesis interesting? Who would care about the conclusions of my study? How would these findings potentially affect our understanding of the world at large? Successful projects from last year which fall into this category are below:


Prediction Projects

The goal of the project is to make accurate predictions about future/unseen events. The best projects in this category are motivated by useful commercial or social applications. "If we could predict X, we'd be able to do Y better." Before pitching the project, you should put yourself in the role of a data science consultant and ask yourself: Who would be hiring me to do this project? Why would the stakeholders care about my ability to do this well? What is the realistic setting in why my predictions will be used, and can I make my evaluation match that setting? Successful projects from last year which fall into this category are below:

  • RideShare Analysis tried to predict, for a new city with no bike share program, where would be the best locations to place new bike hubs, and how would demand change as a function of e.g. weather or day.
  • Music Recommendation Systems (there were several) tried to generate playlists based on individuals' past music preferences and/or natural language descriptions of their tastes.

Capstone Projects

Groups which are doing capstone can do either a hypothesis-testing or a prediction project, but will be required to have a substantial engineering and UI/interactive component. The project pitch should define a clear system or software product spec. The end project should include something "demoable", e.g. a web application or local application that interacts with users in real time.


Project Requirements

Note that every project will be required to meet a minimal threshold in each of three components: a data component, an ML/stats component, and a visualization component. But not every project will invest equally in all three sections. Projects which take on more ambitious data scraping/cleaning efforts will be forgiven for having skimpier visualizations. Projects with ambitious visualizations or UIs will be forgiven if their statistics are more basic. But the project still needs to be complete--you have to understand your data, and be able to answer intelligently when asked about the limitations of your methods and/or findings.

  • Group of 4 students
  • Project must have a clear hypothesis or prediction goal, and have the ability to answer the questions laid out in the section above.
  • Each project must create a database, and perfom some non-trivial data collection and cleaning in order to do so. You may either scrape your own data or use existing data/APIs. If you use existing data/APIs, you will be required to join at least two datasets to create your database.
  • Each project must have a minimum of 5 analysis aspects
    • Use at least two machine learning or statistical analysis techniques to analyze your data, explain what you did, and talk about the inferences you uncovered.
    • Provide at least two distinct visualizations of your data or final results. This means two different techniques. If you use bar charts to analyze one aspect of your data, while you may use bar charts again, the second use will not count as a distinct visualization
    • The additional one technique can be either stats/ml or graphs
  • Every project will be hosted on the course website. Each deliverable will be posted as a link on the page and final posters will be shown there.
  • Each group will be paired with another group who will ask each other questions and give feedback later in the semester.
  • Ethical Considerations:
    • Throughout the final project process, you will be expected to think critically and write about where your data is coming from, how you're analyzing and visualizing your data, and potential positive or negative consequences of your results.
  • Capstone Requirements:
    • All group members must agree to be held to that capstone standard. Even if not everyone in the group is taking the course as a capstone, the entire project will receive the capstone evaluation and be graded accordingly.
    • If you choose to use this course as a capstone, you will extend your project to have a full-fledged web application with an interactive component. For example, previous capstones have included web UIs for plotting roadtrips across the United States and restaurant recommendation apps.
  • You will be given a Github Classroom repository that you will use with your team for the entire project, which will be available shortly after the proposals are due. To submit things for the final project, you just need to push your changes before the deadline. In the repository, we will include stencil HTML files that you can use for your write ups for each deliverable.
  • Please see each section below for more details on each deliverable!

Stages

Proposal + Mentor TA Check-in

Final Project Proposal due: 11:59pm 02/02/2021 ET

TA Check-in: Feb. 5-7

Handin: Form

Each group will meet with their assigned mentor TA to discuss their project proposal. Groups should be prepared to talk about their project ideas and discuss the questions outlined below.

The final project proposal will need to contain a summary of the following:

  • What is the project goal? If a hypothesis-testing project: what is the hypothesis, what is the motivation behind it? If a prediction project: what are you trying to predict, who would be the stakeholders/who cares if you do this well, and how do you measure your success?
  • Where do you plan to get your data? It's okay if this is not a sure-thing, as long as you have an idea of where to start.
  • What are the major technical challenges you anticipate?
  • What ethical problems could you foresee arising (either in the course of doing the project or, if you succeed, in the existence of the technology/result itself)?
After the TA check-in, the group should make the necessary adjustments (if needed), and re-submit the final project proposal in the handin form.

Data Deliverable + TA Feedback

Data due: 11:59pm 03/07/2021 ET

TA Feedback: Mar. 8-10

Handin: Github

After you had the mentor check-in to discuss the project proposal, you should have an idea of how you'll collect and clean your data. For different projects this will vary. If you are scraping, you should write and run your scraper for the data deliverable (exceptions apply if your project requires the scraper to run continuously e.g. to get updated data throughout the semester). If you are using existing datasets, you should clean, join, or organize anything you need to. In either case, you should submit the data in some clean access method like SQL, Pandas, JSON, or similar datastore, to be in a position to begin analysis and/or modeling.

Along with your data, you should submit a Socio-historical Context and Impact Report for your project. Instructions for this report are described in detail below.

Concretely, you should have the following on your submission:

  • A complete data spec describing your collected data. This can be in the form of a README/simple text file, but it should describe the full data format, including assumptions about data types, assumptions about keys and cross-references, whether fields are "required" or optional, etc. For example this is an example of a good README for data in a CSV format (description of corresponding data is here) and this repo has a good example of README for data in JSON format.
  • A link to your full data in downloadable form. Any distribution method is okay, pick what makes sense for your project. E.g. Google drive, DropBox, GitHub, or link from personal website are all fine.
  • A sample of your data (e.g. 10 - 100 rows) that we can easily open and view on our computers.
  • A concise tech report in html format (in the README.md file located at data_deliverable/reports/tech_report in your GitHub repo), answering the following questions:
    • Where is the data from?
      • How did you collect your data?
      • Is the source reputable?
      • How did you generate the sample? Is it comparably small or large? Is it representative or is it likely to exhibit some kind of sampling bias?
      • Are there any other considerations you took into account when collecting your data? This is open-ended based on your data; feel free to leave this blank. (Example: If it's user data, is it public/are they consenting to have their data used? Is the data potentially skewed in any direction?)
    • How clean is the data? Does this data contain what you need in order to complete the project you proposed to do? (Each team will have to go about answering this question differently, but use the following questions as a guide. Graphs and tables are highly encouraged if they allow you to answer these questions more succinctly.)
      • How many data points are there total? How many are there in each group you care about (e.g. if you are dividing your data into positive/negative examples, are they split evenly)? Do you think this is enough data to perform your analysis later on?
      • Are there missing values? Do these occur in fields that are important for your project's goals?
      • Are there duplicates? Do these occur in fields that are important for your project's goals?
      • How is the data distributed? Is it uniform or skewed? Are there outliers? What are the min/max values? (focus on the fields that are most relevant to your project goals)
      • Are there any data type issues (e.g. words in fields that were supposed to be numeric)? Where are these coming from? (E.g. a bug in your scraper? User input?) How will you fix them?
      • Do you need to throw any data away? What data? Why? Any reason this might affect the analyses you are able to run or the conclusions you are able to draw?
    • Summarize any challenges or observations you have made since collecting your data. Then, discuss your next steps and how your data collection has impacted the type of analysis you will perform. (approximately 3-5 sentences)
  • A Socio-historical Context and Impact Report, in the README.md file located in your Github repo. Instructions for this report are described in detail below. The goal of this report is to:
    • Put your data in context with the socio-historical factors surrounding your research questions.
    • Think about how this context should affect the framing of your question, analysis of your data, and interpretation of your results.
    • Critically analyze the potential impact of your project.
  • Socio-historical Context and Impact Report

    Task:

    Your Social Context and Impact Report should include two sections: socio-historical context and ethical considerations. Each section should be about 1-2 paragraphs. Below, we've listed guiding questions that you should address in each section. Not all questions may be relevant to all projects.

    This report will require you to research information beyond your data. Consult at least 3 outside sources. List your sources in a works cited section at the end of your report. Your sources and works cited section can be in any format.

    Please reach out to the STAs (Lena and Gaurav) via Piazza if you’re struggling with finding relevant information for your project. We’ll release a guide soon about how to approach research and how we’ll grade these reports.

    Socio-historical Context

    Research and describe how your project interacts with its socio-historical context. This section should address at least 3 out of the 4 bullet points below.

    • Research the socio-historical context of your project to identify a few societal factors that could affect your data, prediction goal, and/or hypothesis. These factors might include current or historical policies, events, social conditions, larger societal systems, and more. Describe a few of the broader societal issues and their relationship to your data, prediction goal, and/or hypothesis.
    • Who are the major stakeholders in this project? What is your relationship to these stakeholders? Stakeholders are those who may be affected by or have an effect on your project topic. Some examples of stakeholders are a particular demographic group, residents of a particular geographic area, and people experiencing or at risk for a particular problem. Consider the following questions to help identify stakeholders:
      • Who does this project topic currently affect?
      • Who might be harmed by your research findings?
      • Who might benefit from your research findings?
    • Summarize the most relevant technical or non-technical research that has already been conducted about your project topic. If relevant, what was the societal impact of existing research?
    • Discuss the impact of your socio-historical research findings on your project. Consider the following questions to help identify at least one impact:
      • How does this context affect how you should frame your question?
      • How does this context affect how you should analyze your data?
      • How does this context affect how you should interpret your findings?
      • How does this context affect how you should present your results?

    Ethical Considerations

    Discuss a few ethical and societal issues related to your data and analysis. This section should address at least 4 out of the 6 bullet points below. Focus on the questions that feel most relevant to your project.

    • What kind of underlying historical or societal biases might your data contain? How can this bias be mitigated?
      • Were the systems and processes used to collect the data biased against any groups?
    • What biases might exist in your interpretation of the data?
      • What assumptions might influence your decision-making across the data life cycle (choices about which data sources to use, how to deal with missing data, etc.)?
      • How does your identity, prior knowledge, and perspective inform your analysis?
    • How could an individual or particular community’s privacy be affected by the aggregation or analysis of your data?
    • Is data being used in a manner agreed to by the individuals who provided the data?
    • What are possible misinterpretations or misuses of your project results and what can be done to prevent them?
    • Add your own: if there is an ethical or societal issue about your project you would like to discuss or explain further, feel free to do so.

Analysis + Feedback

Analysis due: 11:59pm 04/10/2021 ET

TA Feedback: Apr. 11-13

Handin: Github

Now that you have your data ready, the group will use statistics or machine learning techniques to answer one of their primary questions. Coupled with this is it is expected that you create a vizualization of this result. The visualization may be a table, graph or some other means for conveying the result of the question. We expect that you will not be able to provide a complete solution for any prediction or hypothesis goal with a single ML or statistical model. Concretely you should have the following on your project page:

  • A defined hypothesis or prediction task, with clearly stated metrics for success.
  • Why did you use this statistical test or ML algorithm? Which other tests did you consider or evaluate? How did you measure success or failure? Why that metric/value? What challenges did you face evaluating the model? Did you have to clean or restructure your data?
  • What is your interpretation of the results? Do accept or deny the hypothesis, or are you satisfied with your prediction accuracy? For prediction projects, we expect you to argue why you got the accuracy/success metric you have. Intuitively, how do you react to the results? Are you confident in the results?
  • For your visualization, why did you pick this graph? What alternative ways might you communicate the result? Were there any challenges visualizing the results, if so, what where they? Will your visualization require text to provide context or is it standalone (either is fine, but it's recognize which type your visualization is)?
  • Full results + graphs (at least 1 stats/ml test and at least 1 visualization). You should push your visualizations to the /analysis_deliverable/visualizations folder in your repo. Depending on your model/test/project we would ideally like you to show us your full process so we can evaluate how you conducted the test!
  • If you did a statistics test, are there any confounding trends or variables you might be observing?
  • If you did a machine learning model, why did you choose this machine learning technique? Does your data have any sensitive/protected attributes that could affect your machine learning model?
  • Discussion of visualization/explaining your results on a poster and a discussion of future directions.
NOTE: Please make sure to address every question listed in your report. Failing to do so may result in a low grade.

Final Presentation Materials

Presentations: 11:59pm 04/18/2021 ET

Poster Due: 11:59pm 04/18/2021 ET

Handin: Github

Overview

You are required to complete:

  1. PDF Poster (see details below)
  2. One Page Abstract (see details below)
  3. A Recorded Video Presentation (see details below)
  4. Interactive code (only if capstone)

Poster Requirements

  • 42 inches x 31.5 inches
  • PDF format
  • at least 300 dpi
Your poster presentation is a chance for you to share your findings and accomplishments with your peers! When preparing a poster, it is important to remember that although your group has spent a lot of time and effort becoming familiar with the domain knowledge necessary to understand your results, many others in the class do not necessarily share this background. It is important to design your posters to communicate your results in a clear, concise way.

Making a good academic poster that quickly and effectively delivers the key points about your project takes careful planning. Here are some links about making a good academic poster that we found helpful:
  • This link from NYU
  • This comprehensive list of do's and don'ts
  • Two nice examples from last year's final projects: Example 1, Example 2
Your poster should contain:
  • A name for your project
  • The names of your group members
  • Dataset information and collection details
  • Problem statement / hypothesis
  • Methodology
  • Results & visualization
  • Potential significance/ramifications on the field your data is coming from and/or relevant limitations of the analysis
It is important that your presentation tells a good story and focuses on the most interesting aspects of your project. We are evaluating how well you are able to articulate clear claims (positive or negative) and back up those claims with sound data scientific analysis.

Abstract Requirements

We want every group to write a summary of their work which could be sent out as a one-page standalone document. We provide two example formats below and encourage you to follow these templates exactly. The goals here are:

  • Speak to a 3rd party audience who has not seen your project yet
  • Motivate your project goal
  • Clearly define claims or questions
  • Present the answers you found for those claims/questions
This document can be written in latex, word, or any other medium. It should be titled abstract.pdf. Here are examples we have for hypothesis and prediction projects.
  • Hypothesis Example: here
  • Prediction Example: here
  • Hypothesis Stencil: here
  • Prediction Stencil: here

Capstone Requirements

If you are completing the capstone requirements, you should handin the code for your interactive tool with your poster. To reiterate, your interactive element can be anything that takes user input and updates/adjusts. It can be commandline, webapp, D3, or any other interface. You will be expected to demo this element as part of your presentation.

Presentation

Due to the logistics of the semester, we will not have live presentations of the projects this year. Rather we ask you to submit a video recording of such a presentation.

  • For standard projects, the video should be 10 minutes long.
  • For capstone-level projects, the video can be up to 13 minutes long.
A good poster pitch takes only 4-5 min and should use the poster to underline your main points. Do not make the judge just read the poster. A good poster presentation is like telling a story and engages the listener. Focus on motivating your project, followed by defining your hypothesis or prediction, explain how you came to your metric of success or test, and then present your conclusion. Throughout you should note any outliers, controls, or interesting aspects that contributed to your final project. Be prepared to answer questions from the HTAs + Lorenzo regarding your results, process, and accomplishments.

For capstone-level projects, you will be expected to demo your interactive element as part of your presentation.

If you are unsure on how to record the video, you may consider using the recording feature of Zoom/Google Meets alongside the screensharing function.

Rubric

Below is the rubric that will be used to evaluate your project during the final presentation:
60 big problems, e.g. missing large portions of deliverable
70 The team's analysis suggests misunderstanding of the dataset and/or problem, usually in one of the following ways: 1) the methods chosen do not make sense given the stated goal of the problem (e.g. proposed to test a hypothesis but then focus effort on training a classifier) 2) the techniques used are significantly misused (e.g. testing on training data, interpreting r^2 as evidence for/against a hypothesis) 3) presentation of the results raises many questions that go unanswered (see examples listed in description of 75) 4) discussion of results/descriptions of methods are flatly incorrect 5) other errors that similarly indicate a lack of understanding.
75 The team is familiar with the data and/or problem, but hasn't demonstrated a clear understanding of both in conjunction (e.g. how dataset properties/processing decisions relate to the problem). The team knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance, but the some discussion or presentation of the results raises many questions that go unanswered. E.g. charts show gaps/skews/outliers that are not discussed or addressed, metrics take extreme values that are not explained, results behave unintuitively (test acc > train acc) and no comment is made or explanation provided, pvals/coefficients are misinterpreted, etc. Discussion overall is weak.
80 The team is familiar with the data and/or problem, but hasn't demonstrated a clear understanding of both in conjunction (e.g. how dataset properties/processing decisions relate to the problem). The team knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance, but rationale isn't entirely clear or convincing. Results are not clearly discussed in relation to the overall goal of the project, and/or discussion feels "out of the box". Team has a plan for where to go next.
85 The team is familiar data and the problem, and knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance with rationale as to why those technique are chosen. Results are related to the overall goal of the project, but the discussion feels "out of the box" and/or lacks depth. Team has a good plan for where to go next.
90 The team understands the data and the problem, and knows how to run the first most logical test needed to test hypothesis/demonstrate model performance. Results are interpreted in relation to the overall goal of the project, but maybe not as deeply as would be ideal (e.g. some outliers/data quirks are not commented on). Interpretation is precise and scientific, claims are evidence-based or hedged appropriately. Team has a good, informed plan for where to go next.
95 Clear demonstration that the team understands the data and the problem, and knows how to run not just the first most logical test, but also a logical follow up (including motivation for follow-up analysis, refinement of question when needed). Obvious weirdness in data (e.g. outliers, skews) are noted and explained. Results are interpreted "one level up" from the literal output of the test, and discussed in relation to the overall goal of the project. Interpretation is precise and scientific, claims are evidence-based or are hedged appropriately. Good plan for next steps.

Rubric

Below is an approximate breakdown of the weight of the project deliverable components. However, your project will be graded holistically. If your final product is A-level, you can recieve an A on the project overall, even if earlier components were weak. (The point of early feedback is to learn and grow, afterall.) Focus on doing good work, and you will be fine.

Points Component Grader
5 Project Proposal Lorenzo + HTAs
10 Data Deliverable Lorenzo + HTAs
20 Analysis Deliverable Lorenzo + HTAs
50 Poster Presentation to Lorenzo + HTAs Lorenzo + HTAs
15 Mentor TA Evaluation Mentor TA