Semantic Modelling

Fotopedia, cc-by-nc, Andrea Kirkby, 2008

HTW Berlin
Fachbereich 4
Internationaler Studiengang
Internationale Medieninformatik (Master)
Semantic Modeling
Summer Term 2016

Lab 10: Nearest Neighbor Learning

Start Weka and load the glass dataset glass.arff. This data is from the U.S. Forensic Science Service and contains data on six types of glass. Glass is described by its refractive index and the chemical elements that it contains; the aim is to classify different types of glass based on these features.
How many attributes are there in the dataset? What are their names? What is the class attribute? Run the classification algorithm IBk. Use cross-validataion to test its performance, leaving the number of folds at the default value of 10. Recall that you can examine the classifier options in the Generic Object Editor window that pops up when you click the text beside the Choose button. The default value of the KNN field is 1: this sets the number of neighboring instances to use when classifying.
What is the accuracy of the IBk (given in the Classifier Output box)? Run IBk again, but increase the number of neighboring instances to k=5 by entering this value in the KNN field. Continue to use the cross-validation. What is the accuracy of IBk with five neighboring instances?
Why is it infeasable to perform an exhaustive search over all possible subsets of attributes?
Apply the backward elimination procedure (described on page 311 in the book). First consider dropping each attribute individually from the full dataset, and then run a cross-validation for each reduced version. Once you have determined the best eight-attribute dataset, repeat the procedure with this reduced dataset to find the best seven-attribute dataset, and so on.

Record the best attribute set and the greatest accuracy on obtained in each iteration. The best accuracy obtained in the process is quite a bit higher than the accuracy obtained on the full dataset.

Subset Size (No. of Attributes)	Attributes in "Best" Subset	Classification Accuracy
9
8
7
6
5
4
3
2
1
0

Is the best accuracy an unbiased estimate of accuracy on future data? Explain. (Hint: To obtain an unbiased estimate of accuracy on future data, we must not look at the test data at all when producing the classification model for which the estimate is being obtained.

Nearest neighbor learning is sensitive to noise in the training data. You can flip a certain percentage of class labels in the data to a randomly chosen other value using an unsupervised attribute filter called AddNoise. For this experiment it is important that the test data remains unaffected by class noise. Filtering the training data without filtering the test data is a common requirement, and is achieved using a metalearner called FilteredClassifier. This metalearner should be configured to use 1Bk as the classifier and AddNoise as the filter. FilteredClassifier applies the filter to the data before running the learning algorithm. This is done in two batches: first the training data and then the test data. The AddNoise filter only adds noise to the first batch of data it encounters, which means that the test data passes through unchanged.

Record the cross-validated accuracy estimate for IBk for 10 different percentages of class noise and neighborhood sizes k=1, k=3, k=5.
What is the effect of increasing the amount of class noise?

What is the effect of altering the value of k?

Effect of Class Noise on 1Bk, for Different Neighborhood Sizes

Percentage Noise	k=1	k=3	k=5
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

Again using the glass data, use the IBk and the J48 in order to obtain learning curves using Filtered Classifier. Use it together with Resample, which extracts a certain specified percentage of a given dataset and returns the reduced dataset. Again, this is done only for the first batch to which the filter is applied, so the test data passes unmodified through the FilteredClasifier befire it reaches the classifier.

1. Record the data for learning curves for both the one-nearest neighbor classifier and J48.

Effect of Training Set Size on IBk and J48

Percentage of Training Set	k=1	k=3
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%

2. What is the effect of increasing the amount of training data?
3. Is this effect more pronounced for IBk or J48?

Prepare a report detailing what you did. You should work in groups of 2 or 3. Submit your written report (everyone should have their own copy submitted) to the Moodle area by 22.00 the evening before the session it is due.

This exercise is taken from Witten, Frank & Hall: Data Mining, 3rd Edition.

Some rights reserved. CC-BY-NC Prof. Dr. Debora Weber-Wulff
Questions or comments: <weberwu@htw-berlin.de>