Data Analysis Techniques: Notes, Data, and Other Materials
Here are the notes Prof. Widom used for the presentation on data analysis techniques: DataAnalysisTechniques.pdf.
Here are the data sets and other graphics used in class:
Basic database operations
- Temperatures table: Temps.csv
- Filtering: VeryCold.csv
- Sorting: SouthNorth.csv
- Aggregating: AvgTemp.csv, AvgByState.csv
- Joining: Regions.csv, JoinedTables.csv
- Composing operations: GrandFinale.csv
Regression
- Temp vs latitude: TempVsLat.pdf, TempVsLatRegression.png
- Temp vs longitude: TempVsLong.pdf, TempVsLongRegression.png
- Underfitting, overfitting, limitations: UnderOverFitting.png, AnscombesQuartet.svg
Classification
- K nearest neighbors (KNN)
- Temperature example: TempsCat.csv, LatLongScatter.png
- Use for regression: LatLongScatterTemps.png
- Decision tree classifier
- Temperature example - predict category from feature: CatNoTemps.csv, CatNoTempsSorted.csv
- Naive Bayes: probabilistic
- Underfitting and overfitting in classification
Clustering
- Temperature example - cluster cities into six groups based on latitude/longitude: Points.png, Clusters.png, ClusterMeans.png
Here are some additional readings covering many of the techniques:
- Data Mining
- Regression
- Classification
- Clustering