Haskell text classification using Tf-Idf

Posted on September 21, 2016

This is part two in a two part blog series about haskell terminal applications, this blog shows a simple text classification implementation using techniques from part one.

Text Classification with Tf-Idf

There are many ways to classify documents ranging from simple to very complex. The algorithm I’m using here is called Tf-Idf or “term frequency / inverse document frequency”. There are a number of sites that explain how it works better in detail than I would. See for example

  1. Wikipedia
  2. What does tf-idf mean?

Basically Tf-Idf counts the number of times a term occurs (term frequency) and combines that with a negative weighting for the number of times the term occurs in all categories. This means that common words that exist in multiple categories are going to count less towards the final score.

There are also multiple ways that TfIdf itself can be implemented e.g. with different algorithms for weighting the Tf vs the Idf or using n-grams (where n > 1). I’m going with a pretty simple implementation but even with that I’ve seen pretty accurate results with the classifications I’m doing. I’m primarily using this for classifying short sentences of text. So it has been tested for simple matching on relatively small documents.

The TextClassification application

You can get the source for TextClassify at https://github.com/andrevdm/TextClassify. The code is reasonably well commented IMO, so I wont go into too much detail here on every line of code

Below I’ll discuss some implementation details not covered by the code comments.

Using the application

  1. The user sets up a directory of text files, one file per category.
    • These files contain the text that each category should match against.
    • Since, in this implementation, I’m not using n-grams each file is treated as a “bag of words” and newlines etc are ignored.
  2. Given the set of categories (the training set) the user then provides an input file (or piped via stdin) containing the text to be matched.
    • The data can be provided in plain text or in a CSV
  3. The application will ‘clean’ the input data and classify it
  4. The results will be written to stdout and can be piped to a file if required

Using sed, awk and column

There are a large number of existing terminal applications so it often makes sense to use this existing functionality as well as writing terminal applications so that they too can be reused.

Removing lines with awk

The CSV files I work with have a header that needs to be removed. Here is a awk script (removePrefix.awk) to do that

This script can be used to pre-process the CSV file

Cleaning text with sed

The higher the quality of the input data to the classification algorithm the better the results will be. Some regular expressions can clean up the input text nicely. Here is a sed script that does this

This sed script removes some common words (the months), removes special characters and multiple spaces. You can customise this or create one per type of input as required. The -u parameter is important as it disables buffering which may interfere with line-by-line processing.

The TextClassification application will start sed and keep it running. A line of input data will be passed to it and the result read back a line at a time.

Displaying CSV results with column

column can be used to show CSV data as an aligned table in the terminal. I’ll use this later to show the results of the classification.

Command line arguments

See Args.hs

As part one showed I’m using OptParse-generic to parse the command line arguments.

This is the resulting help text from these arguments

These arguments are then interpreted and stored in the Options type

Input handle

hin is set to the handle of the input stream, stdin if no --input parameter is present else the handle for the file

Text cleaning with the cleaning script

Above I showed a sed that could be used to clean the input text. However because this application can use a CSV as the input it can’t simply apply the cleaning to the entire file or even an entire line. Only the text being classified should be cleaned. To do this an instance of sed is started and fed the text to clean one line at a time. (Actually any app could be used as long as it reads and writes one line at a time). The name of the app / script to use is defined by the --clean parameter

The getCleaner function is passed (the optional) name of the cleaner script. If a script was specified then a processes is started and a curried cleanText function is returned as the cleaning function. If no script was specified then the returned cleaning function simply performs a toLower on the text.

cleanText writes a line to the input handle for the process and then immediately reads the response line.

Reading the input data

TextClassifier has three parsers

  1. CSV - one of the columns is the data column
  2. Lines - each line is the data
  3. Detail - same as line but additional information is printed for each input line

whileM_ is used to read a line of input at a time. The line is then passed to the appropriate parsers, i.e. CSV, line or detail.

Parsing the CSV Data

See ClassifyCsv.hs and Classify.hs

I’m using Cassava to read the CSV file as well as creating the output csv. Since I’m not interpreting any of the data apart from the text to be classified I’m simply reading the CSV as a vector of Text.

Given a vector of Text it is simple to get the column containing the text to classify. The parseCsvLine function returns a ParsedLine a type which contains the text to be classified.

Remember that each line of data must be cleaned. Rather than having parseCsvLine live in IO it returns a ParsedLine a type. The code in Main then calls the cleaner and passes the resulting CleanedLine a to categoriseCsvLine. This limits the amount of code that needs to be in IO. It also make the code easier to test (e.g. from the REPL) as the two functions can be tested independently.

Tf-Idf

Training set

See ClassifyIO.hs

The training set is a directory with a file per category. Each file contains the words for that category. To load the files the loadTrainingSet function is used

All .txt files in the directory are loaded and result in a category of words.

Tf-Idf

See TfIdf.hs

To review the terminology

the train function takes a TrainingSet and creates a TrainedData

Categorising text is handled by the categorise function. Given a collection of words it returns the best matching category if one was found. classifyDetail returns all possible matches sorted best match first. Both functions use cagegoriseWords to do the actual classification.

To calculate the Tf and the Idf values the following two functions are used

Notice that there is no need for IO at all in the TfIdf module. It is given a loaded training set and cleaned text to classify.

The classification is then just finding the category with the closest matching tf-idf value

Using txtcls

Building

Installing

This will install txtcls into your local stack bin folder.

Usage Instructions

Usage examples

The examples folder contains scripts showing how txtcls can be used. The files are

  1. cleanText.sed - sed script for cleaning the words
  2. skipLines.awk - awk script for skipping lines in the input CSV
  3. egLines.txt - example of data where each line is the data
  4. egCsv.csv - example of data in csv
  5. egCsvWithHeader.csv - example of data in CSV with a header text
  6. demoCsv.sh - run the example on egCsv.csv
  7. demoLines.sh - run the example on egLines.txt
  8. demoDetail.sh - run the example on egLines.txt using the detail output
  9. demoCsvWithHeader.sh - run the example on egCsvWithHeader.csv
  10. demoDetailInteractive.sh - run the detail parser interactively, read a line from stdin and write to stdout
  11. trainingData/cs.txt
  12. trainingData/hasekll.txt

Lines

Where

Detail

Where

CSV

Where

CSV with header text

skipLines.awk egCsvWithHeader.csv | txtcls --train ./trainingData --parser csv --popts 2 --clean ./cleanText.sed | column -s , -t

Where

interactive

In conclusion

The source code for TextClassify is available and commented. You should hopefully be able to look at it and understand how it was implemented.

The important things to notice are

Links

  1. Source code for the examples
  2. OptParse-generic
  3. Cassava
  4. Stack.
  5. Protolude
  6. Haskell Programming from first principles.