Haskell text classification using Tf-Idf
This is part two in a two part blog series about haskell terminal applications, this blog shows a simple text classification implementation using techniques from part one.
Text Classification with Tf-Idf
There are many ways to classify documents ranging from simple to very complex. The algorithm I’m using here is called Tf-Idf or “term frequency / inverse document frequency”. There are a number of sites that explain how it works better in detail than I would. See for example
Basically Tf-Idf counts the number of times a term occurs (term frequency) and combines that with a negative weighting for the number of times the term occurs in all categories. This means that common words that exist in multiple categories are going to count less towards the final score.
There are also multiple ways that TfIdf itself can be implemented e.g. with different algorithms for weighting the Tf vs the Idf or using n-grams (where n > 1). I’m going with a pretty simple implementation but even with that I’ve seen pretty accurate results with the classifications I’m doing. I’m primarily using this for classifying short sentences of text. So it has been tested for simple matching on relatively small documents.
The TextClassification application
You can get the source for TextClassify at https://github.com/andrevdm/TextClassify. The code is reasonably well commented IMO, so I wont go into too much detail here on every line of code
Below I’ll discuss some implementation details not covered by the code comments.
Using the application
- The user sets up a directory of text files, one file per category.
- These files contain the text that each category should match against.
- Since, in this implementation, I’m not using n-grams each file is treated as a “bag of words” and newlines etc are ignored.
- Given the set of categories (the training set) the user then provides an input file (or piped via stdin) containing the text to be matched.
- The data can be provided in plain text or in a CSV
- The application will ‘clean’ the input data and classify it
- The results will be written to stdout and can be piped to a file if required
Using sed, awk and column
There are a large number of existing terminal applications so it often makes sense to use this existing functionality as well as writing terminal applications so that they too can be reused.
Removing lines with awk
The CSV files I work with have a header that needs to be removed. Here is a awk script (removePrefix.awk) to do that
This script can be used to pre-process the CSV file
Cleaning text with sed
The higher the quality of the input data to the classification algorithm the better the results will be. Some regular expressions can clean up the input text nicely. Here is a sed script that does this
#!/bin/sed -uf
s/c\*/ /gi
s/jan\|feb\|mar\|apr\|may\|jun\|jul\|aug\|sep\|oct\|nov\|dec/ /gi
s/ \+$//gi
s/\[\(\)!\-\/*\\\]/ /g
s/[\.*\/\(\)_,\]/ /g
s/-/ /g
s/\t/ /g
s/ \+/ /g
s/ \+$//gi
s/^ \+//gi
This sed script removes some common words (the months), removes special characters and multiple spaces. You can customise this or create one per type of input as required. The -u parameter is important as it disables buffering which may interfere with line-by-line processing.
The TextClassification application will start sed and keep it running. A line of input data will be passed to it and the result read back a line at a time.
Displaying CSV results with column
column
can be used to show CSV data as an aligned table in the terminal. I’ll use this later to show the results of the classification.
Command line arguments
See Args.hs
As part one showed I’m using OptParse-generic to parse the command line arguments.
data Arguments = Arguments {train :: Text <?> "Path to training data"
,input :: Maybe Text <?> "Input file to categorise. If missing stdin will be used"
,parser :: Maybe Text <?> "Parser type, defaults to lines. Options are lines/detail/csv"
,popts :: Maybe Text <?> "Parser options"
,clean :: Maybe Text <?> "Options name of text cleaner - see docs"
} deriving (Generic, Show)
instance ParseRecord Arguments
This is the resulting help text from these arguments
Usage: txtcls --train STRING [--input STRING] [--parser TEXT] [--popts TEXT]
[--clean TEXT]
Available options:
-h,--help Show this help text
--train TEXT Path to training data
--input TEXT Input file to categorise. If missing stdin will be
used
--parser TEXT Parser type, defaults to lines. Options are
lines/detail/csv
--popts TEXT Parser options
--clean TEXT Options name of text cleaner - see docs
These arguments are then interpreted and stored in the Options type
data Options = Options {trainingPath :: Text
,parserType :: Text
,parserOptions :: Maybe Text
,txtCleaner :: Text -> IO Text
,hin :: Handle
,hout :: Handle
}
Input handle
hin is set to the handle of the input stream, stdin if no --input
parameter is present else the handle for the file
hin_ <- case unHelpful $ input args of
Just t ->
openFile (Txt.unpack t) ReadMode
Nothing ->
pure stdin
Text cleaning with the cleaning script
Above I showed a sed that could be used to clean the input text. However because this application can use a CSV as the input it can’t simply apply the cleaning to the entire file or even an entire line. Only the text being classified should be cleaned. To do this an instance of sed is started and fed the text to clean one line at a time. (Actually any app could be used as long as it reads and writes one line at a time). The name of the app / script to use is defined by the --clean
parameter
-- | Build a 'cleaner'
getCleaner :: Maybe Text -> IO (Text -> IO Text)
getCleaner mcmd =
case mcmd of
-- | The cleaner uses the extenal process to do the actual cleaning. One line is writtent to the processes' stdin and then a value is read from its stdout
Just cmd -> do
(Just inp, Just outp, _, phandle) <- createProcess (proc (Txt.unpack cmd) []) { std_out = CreatePipe, std_in = CreatePipe }
hSetBuffering outp NoBuffering
hSetBuffering inp LineBuffering
pure $ cleanText inp outp
-- | No external cleaner. Just make the text lower case
Nothing ->
pure $ pure . Txt.toLower
-- | Used by getCleaner to build a curried cleaner function
cleanText :: Handle -> Handle -> Text -> IO Text
cleanText inp outp txt = do
hPutStrLn inp $ Txt.unpack (Txt.toLower txt)
pure . Txt.pack =<< hGetLine outp
The getCleaner function is passed (the optional) name of the cleaner script. If a script was specified then a processes is started and a curried cleanText function is returned as the cleaning function. If no script was specified then the returned cleaning function simply performs a toLower on the text.
cleanText writes a line to the input handle for the process and then immediately reads the response line.
Reading the input data
TextClassifier has three parsers
- CSV - one of the columns is the data column
- Lines - each line is the data
- Detail - same as line but additional information is printed for each input line
whileM_ is used to read a line of input at a time. The line is then passed to the appropriate parsers, i.e. CSV, line or detail.
-- | Read input a line at a time and pass it to the parser
whileM_ (not <$> IO.hIsEOF (Args.hin opts)) $ do
-- | line of data
origChars <- IO.hGetLine $ Args.hin opts
let origLine = Txt.pack origChars
-- | parse the line and get the results to display
parsed <- case parser of
---
Parsing the CSV Data
See ClassifyCsv.hs and Classify.hs
I’m using Cassava to read the CSV file as well as creating the output csv. Since I’m not interpreting any of the data apart from the text to be classified I’m simply reading the CSV as a vector of Text.
let contents = BL8.pack . Txt.unpack $ line in
let parsed = decode NoHeader contents :: Either [Char] (V.Vector [Text]) in
Given a vector of Text it is simple to get the column containing the text to classify. The parseCsvLine function returns a ParsedLine a type which contains the text to be classified.
newtype RawText = RawText Text deriving (Show)
data ParsedLine a = ParsedLine RawText a deriving (Show)
Remember that each line of data must be cleaned. Rather than having parseCsvLine live in IO it returns a ParsedLine a type. The code in Main then calls the cleaner and passes the resulting CleanedLine a to categoriseCsvLine. This limits the amount of code that needs to be in IO. It also make the code easier to test (e.g. from the REPL) as the two functions can be tested independently.
Tf-Idf
Training set
See ClassifyIO.hs
The training set is a directory with a file per category. Each file contains the words for that category. To load the files the loadTrainingSet function is used
newtype Category = Category Text deriving (Show, Eq, Ord)
data TrainingSet = TrainingSet [(Category, [Text])] deriving (Show)
loadTrainingSet :: Args.Options -> FilePath -> IO TrainingSet
All .txt files in the directory are loaded and result in a category of words.
Tf-Idf
See TfIdf.hs
To review the terminology
- Term - a single word
- Category - category name
- Document - a document of terms mapped to their TfIdf value
the train function takes a TrainingSet and creates a TrainedData
-- | A term is a single word
newtype Term = Term Text deriving (Show, Eq, Ord)
-- | A category name
newtype Category = Category Text deriving (Show, Eq, Ord)
-- | Term frequency value
newtype Tf = Tf Double deriving (Show)
-- | Inverse document frequency value
newtype Idf = Idf Double deriving (Show)
-- | The combined Tf and Idf value
newtype TfIdf = TfIdf Double deriving (Show)
-- | A document is a map of terms to TfIdf
newtype Document = Document (Map Term TfIdf) deriving (Show)
-- | Data making up the training set
data TrainingSet = TrainingSet [(Category, [Text])] deriving (Show)
-- | The trained data, each category linked to a document
data TrainedData = TrainedData [(Category, Document)] deriving (Show)
train :: TrainingSet -> TrainedData
Categorising text is handled by the categorise function. Given a collection of words it returns the best matching category if one was found. classifyDetail returns all possible matches sorted best match first. Both functions use cagegoriseWords to do the actual classification.
-- | Classify a line of text and try get the best matching category
classify :: Args.Options -> TrainedData -> Text -> Maybe (Category, Double) -- In Classify.hs
-- | Classify a line of text and get all matching categories, best first
classifyDetail :: TrainedData -> Text -> [(Category, Double)] -- In Classify.hs
categoriseWords :: TrainedData -> [Text] -> [(Category, Double)] -- in TfIdf.hs
To calculate the Tf and the Idf values the following two functions are used
-- | Calgulate the term frequency for a collection of words
-- | Tf = occurrence / terms in document.
calcTermFreq :: [Text] -> Map Term Tf
calcTermFreq terms =
-- | Map of term to number of occurrences
let freq = Map.fromListWith (+) [(Term t, 1) | t <- terms] in
-- | Document of Term to freq. Tf = occurrence count / terms in doc
(\d -> Tf $ d / fromIntegral(length terms)) <$> freq
-- | Claculate the term's inverse document frequency
-- | Idf = (tf + 1) / (number of documents + 1)
-- | + 1 is used to avoid divide by zero
calcTermIdf :: [Map Term a] -> Term -> Idf
calcTermIdf termToTfs term =
let docsWithTerm = filter identity (Map.member term <$> termToTfs) in
Idf $ log ((fromIntegral . length $ termToTfs) + 1) / ((fromIntegral . length $ docsWithTerm) + 1)
Notice that there is no need for IO at all in the TfIdf module. It is given a loaded training set and cleaned text to classify.
The classification is then just finding the category with the closest matching tf-idf value
-- | Calculate how well terms matches categories
compareToCategory :: [(Term,TfIdf)] -> (Category, Document) -> (Category, Double)
compareToCategory searchTfIdf (cat, Document catMap) =
let catList = Map.toList catMap in
-- | common words in the category and the search text
let common = Lst.intersectBy sameTerm catList searchTfIdf in
let commonV = sum $ valFromTfIdf . snd <$> common in
-- | Sum of all the TfIdf values
let allV = sum (valFromTfIdf . snd <$> searchTfIdf) + sum (valFromTfIdf . snd <$> catList) in
-- | Similarity = ((common a) + (common b)) / (sum all tfIdf)
(cat, (commonV * 2) / allV)
Using txtcls
Building
Installing
This will install txtcls into your local stack bin folder.
Usage Instructions
txtcls - Text Classifier. Version 0.1.2
Usage: txtcls --train TEXT [--input TEXT] [--parser TEXT] [--popts TEXT]
[--clean TEXT]
Available options:
-h,--help Show this help text
--train TEXT Path to training data
--input TEXT Input file to categorise. If missing stdin will be
used
--parser TEXT Parser type, defaults to lines. Options are
lines/detail/csv
--popts TEXT Parser options
--clean TEXT Options name of text cleaner - see docs
Usage examples
The examples folder contains scripts showing how txtcls can be used. The files are
- cleanText.sed - sed script for cleaning the words
- skipLines.awk - awk script for skipping lines in the input CSV
- egLines.txt - example of data where each line is the data
- egCsv.csv - example of data in csv
- egCsvWithHeader.csv - example of data in CSV with a header text
- demoCsv.sh - run the example on egCsv.csv
- demoLines.sh - run the example on egLines.txt
- demoDetail.sh - run the example on egLines.txt using the detail output
- demoCsvWithHeader.sh - run the example on egCsvWithHeader.csv
- demoDetailInteractive.sh - run the detail parser interactively, read a line from stdin and write to stdout
- trainingData/cs.txt
- trainingData/hasekll.txt
Lines
Where
--train ./trainingData
- is the path to the folder with the training data
--input egLines.txt
- is the data source to classify
--parser lines
- is the parser to use
--clean ./cleanText.sed
- is the external process or script to use to clean the text
Detail
Where
--parser detail
{.bash}- is the parser to use
CSV
Where
--popts 2
{.bash}- is column in the CSV data that contains the data to classify
| column -s , -t
{.bash}- pipes the resulting CSV to column to display it as a table in the terminal window
CSV with header text
skipLines.awk egCsvWithHeader.csv | txtcls --train ./trainingData --parser csv --popts 2 --clean ./cleanText.sed | column -s , -t
Where
./skipLines.awk egCsvWithHeader.csv |
- uses the awk script to remove 4 lines from the input CSV. Note that there is no
--input
paramter so the input is read from stdin (here the output of awk)
- uses the awk script to remove 4 lines from the input CSV. Note that there is no
interactive
In conclusion
The source code for TextClassify is available and commented. You should hopefully be able to look at it and understand how it was implemented.
The important things to notice are
- How little needs to be in IO
- How to use an external process (like sed) as part of your application’s pipeline
- How to write your application so that it can be part of a larger pipeline (e.g. pipe the results of awk to it)
- Selecting input from stdin or a file
- Tf-Idf is a relatively simple but quite accurate way to classify simple documents