|
|
HTW Berlin
Fachbereich 4
Internationaler Studiengang
Internationale Medieninformatik (Master)
Semantic Modeling
Summer Term 2016
|
Lab 11: Document Classification
This week we will be experimenting with document classification. The raw data is text, and this must first be converted into a form suitable for learning by creating a dictionary of terms from all the documents in the training corpus and making a numeric attribute for each term using Weka's unsupervised attribute filter StringToWordVector. There is also the class attribute, which gives the document's label.
- To perform document classification, first create an ARFF file with a string attribute that holds the document's text -- declared in the header of the ARFF file using @attribute document string, where document is the name of the attribute. A nominal attribute is also needed to hold the document's classification. Make an ARFF file from the labeled mini-documents in the following table and run StringToWordVector with default options on this data. How many attributes are generated? Now change the value of the option minTermFreq to 2. What attributes are generated now?
Training Documents
Document Text |
Classification |
The price of crude oil has increased significantly |
yes |
Demand for crude oil outstrips supply |
yes |
Some people do not like the flavor of olive oil |
no |
The food was very oily |
no |
Crude oil is in short supply |
yes |
Use a bit of cooking oil in the frying pan. |
no |
- Build a J48 decision tree from the last version of the data you generated.
- Classify these new test documents based on the decision tree generated from the documents from exercise 1. Use FilteredClassifier to apply the same filter to both the training and the test documents, specifying the StringToWordVector filter and J48 as the base classifier. Create an ARFF file from the table below, using question marks for the missing class labels. Configure FilteredClassifier using default options for StringToWordVector and J48, and specify your new ARFF file as the test set. Make sure that you select Output predictions under More options in the Classify panel. Look at the model and the predictions it generates, and verify that they are consistent. What are the predictions?
Test Documents
Document Text |
Classification |
Oil platforms extract crude oil |
unknown |
Canola oil is supposed to be healthy |
unknown |
Iraq has significant oil reserves |
unknown |
There are different types of cooking oil |
unknown |
He has quite oily skin |
unknown |
The company was cooking the books on crude oil supply |
unknown |
- A standard collection of newswire articles is widely used for evaluating document classifiers. ReutersCorn-train.arff and ReutersGrain-train.arff are training sets derived from this collection, -test are the corresponding test sets. The actual documents in the corn and grain data are the same, only the labels differ. In the first dataset, articles concerning corn-related issues have a class value of 1 and the others 0; the aim is to build a classifier that identifies "corny" articles. In the second, the labeling is performed with respect to grain-related issues.
Build classifiers for the two training sets by applying FilteredClassifier with StringToWordVector using J48 and NaiveBayesMultinomial, evaluating them on the corresponding test set in each case. What percentage of correct classifications is obtained in the four scenarios? Based on the results, which classifier would you choose?
- Other evaluation metrics are used for document classification besides the percentage of correct classifications. They are tabulated under Detailed Accuracy By Class in the Classifier Output area -- the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The statistics output by Weka are computed as specified in the book in table 5.7:
Table 5.7 Different Measures Used to Evaluate False Posititve versus False Negative Tradeoff
|
Domain |
Plot |
Axes |
Explanation of Axes |
|
|
|
|
|
Lift chart |
Marketing |
TP vs.
subset size |
TP
subset size |
number of true positives
TP + FP
----------------- * 100%
TP+FP+TN+FN |
ROC curve |
Communications |
TP rate vs.
FP rate |
TP rate
FP rate |
TP
tp = -------------- * 100%
TP + FN
FP
fp = -------------- * 100%
FP + TN
|
Recall-precision curve |
Information Retrieval |
Recall vs.
precision |
Recall
Precision |
same as TP rate of tp above
TP
-------------- * 100%
TP + FP |
Based on the formulas in Table 5.7, what are the best possible values for each of the output statistics? Describe when these values are attained.
- The Classifier Output also gives the ROC area (also called the AUC), which is the probability that a randomly chosen positive instance in the test data is ranked above a randomly chosen negative instance, based on the ranking produced by the classifier. The best outcome is that all positive examples are ranked above all negative examples, in which case the AUC is 1. In the worst case it is 0. In the case where the ranking is essentially random, the AUC is 0.5, and if it is significantly less than this the classifier has performed anti-learning!
Which of the two classifiers used above produces the best AUC for the two Reuters datasets? Compare this to the outcome for percent correct. What do the different outcomes mean?
- For the Reuters dataset that produced the most extreme difference in exercise 6, look at the ROC curves for class 1. Make a very rough estimate of the area unter each curve, and explain it in complete sentences.
- What does the ideal ROC curve corresponding to perfect performance look like?
Prepare a report detailing what you did. You should work in groups of 2 or 3. Submit your written report (everyone should have their own copy submitted) to the Moodle area by 22.00 the evening before the session it is due.
This exercise is taken from Witten, Frank & Hall: Data Mining, 3rd Edition.
Some rights reserved. CC-BY-NC
Prof. Dr. Debora Weber-Wulff
Questions or comments:
<weberwu@htw-berlin.de>