HTW Berlin Fotopedia, cc-by-nc, Andrea Kirkby, 2008

HTW Berlin
Fachbereich 4
Internationaler Studiengang
Internationale Medieninformatik (Master)
Semantic Modeling
Summer Term 2016

Lab 11: Document Classification

This week we will be experimenting with document classification. The raw data is text, and this must first be converted into a form suitable for learning by creating a dictionary of terms from all the documents in the training corpus and making a numeric attribute for each term using Weka's unsupervised attribute filter StringToWordVector. There is also the class attribute, which gives the document's label.

  1. To perform document classification, first create an ARFF file with a string attribute that holds the document's text -- declared in the header of the ARFF file using @attribute document string, where document is the name of the attribute. A nominal attribute is also needed to hold the document's classification. Make an ARFF file from the labeled mini-documents in the following table and run StringToWordVector with default options on this data. How many attributes are generated? Now change the value of the option minTermFreq to 2. What attributes are generated now?

    Training Documents


    Document Text

    Classification
    The price of crude oil has increased significantly
    yes
    Demand for crude oil outstrips supply
    yes
    Some people do not like the flavor of olive oil

    no

    The food was very oily
    no
    Crude oil is in short supply
    yes
    Use a bit of cooking oil in the frying pan.
    no

  2. Build a J48 decision tree from the last version of the data you generated.
  3. Classify these new test documents based on the decision tree generated from the documents from exercise 1. Use FilteredClassifier to apply the same filter to both the training and the test documents, specifying the StringToWordVector filter and J48 as the base classifier. Create an ARFF file from the table below, using question marks for the missing class labels. Configure FilteredClassifier using default options for StringToWordVector and J48, and specify your new ARFF file as the test set. Make sure that you select Output predictions under More options in the Classify panel. Look at the model and the predictions it generates, and verify that they are consistent. What are the predictions?

    Test Documents


    Document Text

    Classification
    Oil platforms extract crude oil
    unknown
    Canola oil is supposed to be healthy
    unknown
    Iraq has significant oil reserves

    unknown

    There are different types of cooking oil
    unknown
    He has quite oily skin
    unknown
    The company was cooking the books on crude oil supply
    unknown

  4. A standard collection of newswire articles is widely used for evaluating document classifiers. ReutersCorn-train.arff and ReutersGrain-train.arff are training sets derived from this collection, -test are the corresponding test sets. The actual documents in the corn and grain data are the same, only the labels differ. In the first dataset, articles concerning corn-related issues have a class value of 1 and the others 0; the aim is to build a classifier that identifies "corny" articles. In the second, the labeling is performed with respect to grain-related issues.

    Build classifiers for the two training sets by applying FilteredClassifier with StringToWordVector using J48 and NaiveBayesMultinomial, evaluating them on the corresponding test set in each case. What percentage of correct classifications is obtained in the four scenarios? Based on the results, which classifier would you choose?

  5. Other evaluation metrics are used for document classification besides the percentage of correct classifications. They are tabulated under Detailed Accuracy By Class in the Classifier Output area -- the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The statistics output by Weka are computed as specified in the book in table 5.7:

    Table 5.7 Different Measures Used to Evaluate False Posititve versus False Negative Tradeoff
      Domain Plot Axes Explanation of Axes
             
    Lift chart Marketing TP vs.
    subset size
    TP


    subset size
    number of true positives

         TP + FP
    ----------------- * 100%
    TP+FP+TN+FN
    ROC curve Communications TP rate vs.
    FP rate
    TP rate


    FP rate

                 TP
    tp = -------------- * 100%
             TP + FN

                 FP
    fp = -------------- * 100%
             FP + TN

    Recall-precision curve Information Retrieval Recall vs.
    precision
    Recall


    Precision

    same as TP rate of tp above

           TP
    -------------- * 100%
       TP + FP



    Based on the formulas in Table 5.7, what are the best possible values for each of the output statistics? Describe when these values are attained.

  6. The Classifier Output also gives the ROC area (also called the AUC), which is the probability that a randomly chosen positive instance in the test data is ranked above a randomly chosen negative instance, based on the ranking produced by the classifier. The best outcome is that all positive examples are ranked above all negative examples, in which case the AUC is 1. In the worst case it is 0. In the case where the ranking is essentially random, the AUC is 0.5, and if it is significantly less than this the classifier has performed anti-learning!

    Which of the two classifiers used above produces the best AUC for the two Reuters datasets? Compare this to the outcome for percent correct. What do the different outcomes mean?

  7. For the Reuters dataset that produced the most extreme difference in exercise 6, look at the ROC curves for class 1. Make a very rough estimate of the area unter each curve, and explain it in complete sentences.

  8. What does the ideal ROC curve corresponding to perfect performance look like?

Prepare a report detailing what you did. You should work in groups of 2 or 3. Submit your written report (everyone should have their own copy submitted) to the Moodle area by 22.00 the evening before the session it is due.

This exercise is taken from Witten, Frank & Hall: Data Mining, 3rd Edition.


Some rights reserved. CC-BY-NC Prof. Dr. Debora Weber-Wulff
Questions or comments: <weberwu@htw-berlin.de>