Haskell text classification using Tf-Idf

Posted on September 21, 2016

This is part two in a two part blog series about haskell terminal applications, this blog shows a simple text classification implementation using techniques from part one.

Text Classification with Tf-Idf

There are many ways to classify documents ranging from simple to very complex. The algorithm I’m using here is called Tf-Idf or “term frequency / inverse document frequency”. There are a number of sites that explain how it works better in detail than I would. See for example

Basically Tf-Idf counts the number of times a term occurs (term frequency) and combines that with a negative weighting for the number of times the term occurs in all categories. This means that common words that exist in multiple categories are going to count less towards the final score.

There are also multiple ways that TfIdf itself can be implemented e.g. with different algorithms for weighting the Tf vs the Idf or using n-grams (where n > 1). I’m going with a pretty simple implementation but even with that I’ve seen pretty accurate results with the classifications I’m doing. I’m primarily using this for classifying short sentences of text. So it has been tested for simple matching on relatively small documents.

The TextClassification application

You can get the source for TextClassify at https://github.com/andrevdm/TextClassify. The code is reasonably well commented IMO, so I wont go into too much detail here on every line of code

Below I’ll discuss some implementation details not covered by the code comments.

Using the application

The user sets up a directory of text files, one file per category.
- These files contain the text that each category should match against.
- Since, in this implementation, I’m not using n-grams each file is treated as a “bag of words” and newlines etc are ignored.
Given the set of categories (the training set) the user then provides an input file (or piped via stdin) containing the text to be matched.
- The data can be provided in plain text or in a CSV
The application will ‘clean’ the input data and classify it
The results will be written to stdout and can be piped to a file if required

Using sed, awk and column

There are a large number of existing terminal applications so it often makes sense to use this existing functionality as well as writing terminal applications so that they too can be reused.

Removing lines with awk

The CSV files I work with have a header that needs to be removed. Here is a awk script (removePrefix.awk) to do that

#!/usr/bin/awk -f
BEGIN {FS = ",";}
NR<7 {next}
NF { print } 
END { }

This script can be used to pre-process the CSV file

removePrevfix.awk souceFile.csv

Cleaning text with sed

The higher the quality of the input data to the classification algorithm the better the results will be. Some regular expressions can clean up the input text nicely. Here is a sed script that does this

#!/bin/sed -uf
s/c\*/ /gi
s/jan\|feb\|mar\|apr\|may\|jun\|jul\|aug\|sep\|oct\|nov\|dec/ /gi
s/ \+$//gi
s/\[\(\)!\-\/*\\\]/ /g
s/[\.*\/\(\)_,\]/ /g
s/-/ /g
s/\t/ /g
s/  \+/ /g
s/ \+$//gi
s/^ \+//gi

This sed script removes some common words (the months), removes special characters and multiple spaces. You can customise this or create one per type of input as required. The -u parameter is important as it disables buffering which may interfere with line-by-line processing.

The TextClassification application will start sed and keep it running. A line of input data will be passed to it and the result read back a line at a time.

Displaying CSV results with column

column can be used to show CSV data as an aligned table in the terminal. I’ll use this later to show the results of the classification.

cat sourceFile.csv | column -s , -t

Command line arguments

See Args.hs

As part one showed I’m using OptParse-generic to parse the command line arguments.

data Arguments = Arguments {train :: Text <?> "Path to training data"
                           ,input :: Maybe Text <?> "Input file to categorise. If missing stdin will be used"
                           ,parser :: Maybe Text <?> "Parser type, defaults to lines. Options are lines/detail/csv"
                           ,popts :: Maybe Text <?> "Parser options"
                           ,clean :: Maybe Text <?> "Options name of text cleaner - see docs"
                           } deriving (Generic, Show)
instance ParseRecord Arguments

This is the resulting help text from these arguments

  Usage: txtcls --train STRING [--input STRING] [--parser TEXT] [--popts TEXT]
                [--clean TEXT]
  
  Available options:
    -h,--help                Show this help text
    --train TEXT             Path to training data
    --input TEXT             Input file to categorise. If missing stdin will be
                             used
    --parser TEXT            Parser type, defaults to lines. Options are
                             lines/detail/csv
    --popts TEXT             Parser options
    --clean TEXT             Options name of text cleaner - see docs

These arguments are then interpreted and stored in the Options type

data Options = Options {trainingPath :: Text
                       ,parserType :: Text
                       ,parserOptions :: Maybe Text
                       ,txtCleaner :: Text -> IO Text
                       ,hin :: Handle
                       ,hout :: Handle
                       }

Input handle

hin is set to the handle of the input stream, stdin if no --input parameter is present else the handle for the file

hin_ <- case unHelpful $ input args of
           Just t -> 
             openFile (Txt.unpack t) ReadMode
           Nothing ->
              pure stdin

Text cleaning with the cleaning script

Above I showed a sed that could be used to clean the input text. However because this application can use a CSV as the input it can’t simply apply the cleaning to the entire file or even an entire line. Only the text being classified should be cleaned. To do this an instance of sed is started and fed the text to clean one line at a time. (Actually any app could be used as long as it reads and writes one line at a time). The name of the app / script to use is defined by the --clean parameter

-- | Build a 'cleaner'
getCleaner :: Maybe Text -> IO (Text -> IO Text)
getCleaner mcmd = 
  case mcmd of
    -- | The cleaner uses the extenal process to do the actual cleaning. One line is writtent to the processes' stdin and then a value is read from its stdout 
    Just cmd -> do
      (Just inp, Just outp, _, phandle) <- createProcess (proc (Txt.unpack cmd) []) { std_out = CreatePipe, std_in = CreatePipe }
      hSetBuffering outp NoBuffering
      hSetBuffering inp LineBuffering
      pure $ cleanText inp outp
    -- | No external cleaner. Just make the text lower case
    Nothing ->
      pure $ pure . Txt.toLower

-- | Used by getCleaner to build a curried cleaner function
cleanText :: Handle -> Handle -> Text -> IO Text
cleanText inp outp txt = do
  hPutStrLn inp $ Txt.unpack (Txt.toLower txt)
  pure . Txt.pack =<< hGetLine outp

The getCleaner function is passed (the optional) name of the cleaner script. If a script was specified then a processes is started and a curried cleanText function is returned as the cleaning function. If no script was specified then the returned cleaning function simply performs a toLower on the text.

cleanText writes a line to the input handle for the process and then immediately reads the response line.

Reading the input data

TextClassifier has three parsers

CSV - one of the columns is the data column
Lines - each line is the data
Detail - same as line but additional information is printed for each input line

whileM_ is used to read a line of input at a time. The line is then passed to the appropriate parsers, i.e. CSV, line or detail.

-- | Read input a line at a time and pass it to the parser
whileM_ (not <$> IO.hIsEOF (Args.hin opts)) $ do
  -- | line of data
  origChars <- IO.hGetLine $ Args.hin opts
  let origLine = Txt.pack origChars
  
  -- | parse the line and get the results to display
  parsed <- case parser of
              ---

Parsing the CSV Data

See ClassifyCsv.hs and Classify.hs

I’m using Cassava to read the CSV file as well as creating the output csv. Since I’m not interpreting any of the data apart from the text to be classified I’m simply reading the CSV as a vector of Text.

let contents = BL8.pack . Txt.unpack $ line in
let parsed = decode NoHeader contents :: Either [Char] (V.Vector [Text]) in

Given a vector of Text it is simple to get the column containing the text to classify. The parseCsvLine function returns a ParsedLine a type which contains the text to be classified.

newtype RawText = RawText Text deriving (Show)
data ParsedLine a = ParsedLine RawText a deriving (Show)

Remember that each line of data must be cleaned. Rather than having parseCsvLine live in IO it returns a ParsedLine a type. The code in Main then calls the cleaner and passes the resulting CleanedLine a to categoriseCsvLine. This limits the amount of code that needs to be in IO. It also make the code easier to test (e.g. from the REPL) as the two functions can be tested independently.

Tf-Idf

Training set

See ClassifyIO.hs

The training set is a directory with a file per category. Each file contains the words for that category. To load the files the loadTrainingSet function is used

newtype Category = Category Text deriving (Show, Eq, Ord)
data TrainingSet = TrainingSet [(Category, [Text])] deriving (Show)

loadTrainingSet :: Args.Options -> FilePath -> IO TrainingSet

All .txt files in the directory are loaded and result in a category of words.

Tf-Idf

See TfIdf.hs

To review the terminology

Term - a single word
Category - category name
Document - a document of terms mapped to their TfIdf value

the train function takes a TrainingSet and creates a TrainedData

-- | A term is a single word
newtype Term = Term Text deriving (Show, Eq, Ord)
-- | A category name
newtype Category = Category Text deriving (Show, Eq, Ord)
-- | Term frequency value
newtype Tf = Tf Double deriving (Show)
-- | Inverse document frequency value
newtype Idf = Idf Double deriving (Show)
-- | The combined Tf and Idf value
newtype TfIdf = TfIdf Double deriving (Show)
-- | A document is a map of terms to TfIdf
newtype Document = Document (Map Term TfIdf) deriving (Show)

-- | Data making up the training set
data TrainingSet = TrainingSet [(Category, [Text])] deriving (Show)
-- | The trained data, each category linked to a document
data TrainedData = TrainedData [(Category, Document)] deriving (Show)


train :: TrainingSet -> TrainedData

Categorising text is handled by the categorise function. Given a collection of words it returns the best matching category if one was found. classifyDetail returns all possible matches sorted best match first. Both functions use cagegoriseWords to do the actual classification.

-- | Classify a line of text and try get the best matching category
classify :: Args.Options -> TrainedData -> Text -> Maybe (Category, Double) -- In Classify.hs

-- | Classify a line of text and get all matching categories, best first
classifyDetail :: TrainedData -> Text -> [(Category, Double)] -- In Classify.hs


categoriseWords :: TrainedData -> [Text] -> [(Category, Double)]  -- in TfIdf.hs

To calculate the Tf and the Idf values the following two functions are used

-- | Calgulate the term frequency for a collection of words
-- | Tf = occurrence / terms in document.
calcTermFreq :: [Text] -> Map Term Tf
calcTermFreq terms =
  -- | Map of term to number of occurrences
  let freq = Map.fromListWith (+) [(Term t, 1) | t <- terms] in
  -- | Document of Term to freq. Tf = occurrence count / terms in doc
  (\d -> Tf $ d / fromIntegral(length terms)) <$> freq

-- | Claculate the term's inverse document frequency
-- | Idf = (tf + 1) / (number of documents + 1)
-- | + 1 is used to avoid divide by zero
calcTermIdf :: [Map Term a] -> Term -> Idf
calcTermIdf termToTfs term =
  let docsWithTerm = filter identity (Map.member term <$> termToTfs) in
  Idf $ log ((fromIntegral . length $ termToTfs) + 1) / ((fromIntegral . length $ docsWithTerm) + 1)

Notice that there is no need for IO at all in the TfIdf module. It is given a loaded training set and cleaned text to classify.

The classification is then just finding the category with the closest matching tf-idf value

-- | Calculate how well terms matches categories
compareToCategory :: [(Term,TfIdf)] -> (Category, Document) -> (Category, Double)
compareToCategory searchTfIdf (cat, Document catMap) =
  let catList = Map.toList catMap in

  -- | common words in the category and the search text
  let common = Lst.intersectBy sameTerm catList searchTfIdf in
  let commonV = sum $ valFromTfIdf . snd <$> common in

  -- | Sum of all the TfIdf values
  let allV = sum (valFromTfIdf . snd <$> searchTfIdf) + sum (valFromTfIdf . snd <$> catList) in
  
  -- | Similarity = ((common a) + (common b)) / (sum all tfIdf)
  (cat, (commonV * 2) / allV)

Using txtcls

Building

git clone git@github.com:andrevdm/TextClassify.git
stack build

Installing

stack install

This will install txtcls into your local stack bin folder.

Usage Instructions

txtcls --help

txtcls - Text Classifier. Version 0.1.2

Usage: txtcls --train TEXT [--input TEXT] [--parser TEXT] [--popts TEXT]
              [--clean TEXT]

Available options:
  -h,--help                Show this help text
  --train TEXT             Path to training data
  --input TEXT             Input file to categorise. If missing stdin will be
                           used
  --parser TEXT            Parser type, defaults to lines. Options are
                           lines/detail/csv
  --popts TEXT             Parser options
  --clean TEXT             Options name of text cleaner - see docs

Usage examples

The examples folder contains scripts showing how txtcls can be used. The files are

cleanText.sed - sed script for cleaning the words
skipLines.awk - awk script for skipping lines in the input CSV
egLines.txt - example of data where each line is the data
egCsv.csv - example of data in csv
egCsvWithHeader.csv - example of data in CSV with a header text
demoCsv.sh - run the example on egCsv.csv
demoLines.sh - run the example on egLines.txt
demoDetail.sh - run the example on egLines.txt using the detail output
demoCsvWithHeader.sh - run the example on egCsvWithHeader.csv
demoDetailInteractive.sh - run the detail parser interactively, read a line from stdin and write to stdout
trainingData/cs.txt
trainingData/hasekll.txt

Lines

txtcls --train ./trainingData --input egLines.txt --parser lines --clean ./cleanText.sed

Where

--train ./trainingData
- is the path to the folder with the training data
--input egLines.txt
- is the data source to classify
--parser lines
- is the parser to use
--clean ./cleanText.sed
- is the external process or script to use to clean the text

Detail

txtcls --train ./trainingData --input egLines.txt --parser detail --clean ./cleanText.sed

Where

--parser detail {.bash}
- is the parser to use

CSV

txtcls --train ./trainingData --input egCsv.csv --parser csv --popts 2 --clean ./cleanText.sed | column -s , -t

Where

--popts 2 {.bash}
- is column in the CSV data that contains the data to classify
| column -s , -t {.bash}
- pipes the resulting CSV to column to display it as a table in the terminal window

CSV with header text

skipLines.awk egCsvWithHeader.csv | txtcls --train ./trainingData --parser csv --popts 2 --clean ./cleanText.sed | column -s , -t

Where

./skipLines.awk egCsvWithHeader.csv |
- uses the awk script to remove 4 lines from the input CSV. Note that there is no --input paramter so the input is read from stdin (here the output of awk)

interactive

txtcls --train ./trainingData --parser detail --clean ./cleanText.sed

In conclusion

The source code for TextClassify is available and commented. You should hopefully be able to look at it and understand how it was implemented.

The important things to notice are

How little needs to be in IO
How to use an external process (like sed) as part of your application’s pipeline
How to write your application so that it can be part of a larger pipeline (e.g. pipe the results of awk to it)
Selecting input from stdin or a file
Tf-Idf is a relatively simple but quite accurate way to classify simple documents

Haskell text classification using Tf-Idf

Text Classification with Tf-Idf

The TextClassification application

Using the application

Using sed, awk and column

Removing lines with awk

Cleaning text with sed

Displaying CSV results with column

Command line arguments

Input handle

Text cleaning with the cleaning script

Reading the input data

Parsing the CSV Data

Tf-Idf

Training set

Tf-Idf

Using txtcls

Building

Installing

Usage Instructions

Usage examples

Lines

Detail

CSV

CSV with header text

interactive

In conclusion

Links