Hyrax ABIF is a Haskell package, that I created at HyraxBio to test our bioinformatics software pipeline. We have released the HyraxAbif package as open source (BSD3 licence) in the hopes that it will be useful to others.
In this post I’ll show how the package can be used as a standalone tool as well as looking at how the Haskell code works. Skip directly to the usage / ABIF format / Haskell sections if that is all you are interested in reading about.
See the LICENCE file. Please note that this package is distributed without warranties or conditions of any kind.
Part of what we do at HyraxBio is analyze DNA sequences to determine drug resistance for various pathogens. The first step in this process getting DNA data from a sequencing machine. The mechanics of sequencing are pretty complex. Fortunately for us we start with the data already sequenced which means that all the “wet-work” is done and we can analyse and interpret the results as data, i.e. bioinformatics.
DNA is made up of four bases A
, C
, G
and T
(adenine, cytosine, guanine, and thymine respectively). A sequencing machine takes DNA strands and determines the sequences of bases that are present. There is a fair amount of complexity here. You can’t simply grab a strand of DNA and read it in its entirety and certainly not with 100% accuracy (because biology). Rather the DNA is amplified and a consensus of reads for each position in the DNA strand is calculated. There can be both (many) variations of the same virus (mutations) as well as errors in the reading process itself. So each position is calculated based on which of the bases have the strongest signal per position.
ABIF files are generated by these sequencing machines by using chemical reactions that release a tiny amount of coloured light when a reagent reacts with one of the bases. Each base results in a different colour which enables the machine to detect which base is present DNA. The details behind this are fascinating see e.g. wikipedia for more detail if you are interested.
Below is a section of a chromatogram showing a wave for each of the four bases.
This is a perfect chromatogram, there are often multiple possibilities per position of different intensity. In the image below you can see that the second and third positions have more than one possible base, this is called a mix.
Even that is an unnaturally clean chromatogram. In reality they often look more like this, and take complicated base calling software and/or trained lab workers (or overworked PHD students) to decide on what base is actually represented.
The chromatogram data in ABIF format is fed into base calling software like PHRED and/or recall which analyze the chromatogram and decide on which base to call per position. The result being a string of bases (A/C/G/T).
Testing a full bioinformatics pipeline is critical to ensuring that every step works correctly and results in high quality outputs. The problem is that we could find no practical existing way to generate our own chromatograms (ABIFs). It is possible to use a set of existing ABIF files but this has two major problems
HyraxABIF was created to resolve this. It lets us easily create chromatograms from a DNA sequence and thus do all the testing we need to.
The image above has a bit more detail, it shows the bases including ambiguous ones as well as the amino acids.
As discussed above there could be multiple possibilities per position. The IUPAC ambiguity codes (see wikipedia ) or bioinformatics.org are a way of encoding the ambiguity in a single letter. For example a Y
IUPAC code means that the base is either a C
or a T
.
Each group of three nucleotide bases is called a codon and encodes for a single amino acid (in coding regions…). The chromatogram also shows the amino acid per codon. This is not important to know for this post but may help if you see other pictures of other chromatograms as this will usually be show.
HyraxAbif’s primary goal was for generating chromatograms from an input DNA sequence. It can be installed from hackage with cabal, from stack, or by cloning the git repo.
The input for generating a chromatogram is a simplified FASTA format file. These files look like this
> 1
ACTG
The first line is the weight (more on this later), the second is the DNA sequence. Given this input file you would run
hyraxAbif-exe gen inputDir/ outputDir/
and you would end up with a ABIF file and a chromatogram like this
For many scenarios this is all you’ll need. You create a folder of FASTA input files and get a folder of generated corresponding ABIF files.
You can also generate chromatograms with mixes. The first line has the weight for the sequence, and each FASTA file can contain multiple reads.
> weight
read
> weight
read
For example
> 0.5
ACG
> 0.3
AAAA
> 1
__AC
Results in the following weighted nucleotide per position
position | A | C | G | T |
---|---|---|---|---|
0 | 0.5 + 0.3 = 0.8 | 0 | 0 | 0 |
1 | 0.3 | 0.5 | 0 | 0 |
2 | 0.3 + 1 = 1.0 | 0 | 0.5 | 0 |
3 | 0.3 | 1 | 0 | 0 |
Note that 0.3 + 1.0 = 1.0 because the max value is 1.0
And this sample
> 1
ACAG
> 0.3
_GT
> 0.2
_G
results in this chromatogram
The IUPAC codes here are
S
= G
or C
W
= A
or T
A weighted FASTA can represent a reverse read. To do this add a R
suffix to the weight. The data you enter should be entered as if it was a forward read. This data will be complemented and reversed before writing to the ABIF
> 0.9R
ACAG
which results in the sequence TGTC
You can also dump an existing ABIF file
hyraxAbif-exe dump sample.abif
This prints two views of the file. First a detail view, partially show below
and then a summary
For certain types of data (e.g. strings) the parsed value is displayed.
The ABIF format is documented here. As you can see from the spec the ABIF format was modeled after the TIFF format. This means that there is a directory of entries and each entry has a data type.
The spec is quite thorough and explains the layout well. If you are wanting to understand the format it is your best starting point. The spec, however, only goes into detail on the ABIF structure. It does not go into much detail on how the chromatogram data itself is stored, I’ll cover that here.
In the file after the ABIF header and version number is the root directory entry. This entry points to the first of the data directory entries that can be located at any other location in the file.
Each directory entry has an offset to the location of its data in the file, the data size, the element size and number of entries. See the spec or the discussion of the Haskell code below for more details on each field.
Note that for data with a size of four bytes or less, the data is stored in the offset field itself.
Towards the end of the spec are examples of the layout of ABIF files for a few sequencing machines. Lets take a look at some of these for the 3500 layout.
Type | Tag Number | Contains |
---|---|---|
DATA | 1 - 4 | The raw data per base (channel). The order of the bases is specified by the FWO_ entry |
DATA | 5 | Short Array holding measured volts/10 (EP voltage) during run |
DATA | 6 | Short Array holding measured milliAmps trace (EP current) during run |
DATA | 7 | Short Array holding measured milliWatts trace (Laser EP Power) during |
DATA | 8 | Short Array holding measured oven Temperature (polymer temperature) trace during run |
DATA | 9 - 12 | Short Array holding analyzed color data |
This is a pretty intimidating set of values we thought we would have to generate from a FASTA input, just for the traces. Fortunately through trial and error we were able to see that only a small subset of the entries were required for the base calling software we were using (PHRED + Recall). All we needed to generate were the data sections 9 to 12, i.e. one per base, the analyzed colour data.
The four DATA sections we need to generate (entries 9 through 12) contain an array of shorts. Each short represents the intensity of the light for that base at a given point. Each of these DATA sections have the wave of the light intensity over time. The FWO_ directory entry specifies which base each DATA entry represents. We always generate it in the 3500 format, so the order is 9=G
, 10=A
, 11=T
, 12=C
.
The traces above are just the wave form, the PLOC entry specifies the location of each peak. This is the location, across all four DATA entries, where the peak of the waves should be found. There is a single PLOC entry for all four DATA entries.
The other required entries are easy to generate, they are things like the base order (FWO_), file name (PDMF) and called based (PBAS). See the Haskell code discussion below to see them all.
Given a base, we then need to create a wave and a single peak location entry. The data we use for each wave is this array of shorts [0, 0, 128, 512, 1024, 1024, 512, 128, 0, 0]
which creates a wave like this.
The peak is the middle of the wave, nice and simple.
Again see the code discussion for more details in this.
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
module Examples.ReadAb1 where
import Protolude
import qualified Hyrax.Abif.Read as H
-- | Read and print a ABIF file
readAbif :: IO ()
readAbif = do
abif' <- H.readAbif "example.ab1"
case abif' of
Left e -> putStrLn $ "error reading ABIF: " <> e
Right abif ->
-- Print after removing the data, to make it readable
print $ H.clearAbif abif
readAbif
tries to parse an ABIF file, it returns an Either Text Abif
clearAbif
removes all the raw data. If you don’t do this then all the massive byte arrays will get printed tooprint
the result.The functions are all commented and visible on hackage.
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
module Examples.AddComment where
import Protolude
import qualified Hyrax.Abif.Read as H
import qualified Hyrax.Abif.Write as H
-- | Add a comment to an existing AB1 file
addComment :: IO ()
addComment = do
abif' <- H.readAbif "example.ab1"
case abif' of
Left e -> putStrLn $ "error reading ABIF: " <> e
Right abif -> do
let modified = H.addDirectory abif $ H.mkComment "new comment"
H.writeAbif "example.modified.ab1" modified
Only the Right case is different than the previous example
addDirectory
is called to add a new comment directory entry that is created by mkComment
writeAbif
writes the updated file to diskThe Examples
directory contains more examples. You can also look at the Generate
and Main
modules to see how the code is used.
Data.Binary is used to read and write the raw bytes in the ABIF files.
Below is an example of writing two Int8 and an Int32 value.
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Protolude
import qualified Data.Binary.Put as B
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
let outData = B.runPut testWrite
BSL.writeFile "test.dat" outData
where
testWrite :: B.PutM ()
testWrite = do
B.putInt8 1
B.putInt8 2
B.putInt32be 3
runPut
“runs” the PutM
monad. testWrite
can then simply call the put*
functions to write the data in whatever format is required.
This creates a file that looks like this
Reading the data from the file looks like this
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Protolude
import qualified Data.Binary.Get as B
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
inData <- BSL.readFile "test.dat"
let read = B.runGet testRead inData
print read
where
testRead :: B.Get (Int8, Int8, Int32)
testRead = do
a <- B.getInt8
b <- B.getInt8
c <- B.getInt32be
pure (a, b, c)
runGet
is given the ByteString from the file and testRead
gets the values in the appropriate format.
If you prefer applicatives you could instead have written testGet
as
and testWrite
as
Hyrax.Abif contains the core types for the package
module Hyrax.Abif
( Abif (..)
, Header (..)
, Directory (..)
, ElemType (..)
, getElemType
, describeElemType
) where
import Protolude
import qualified Data.ByteString.Lazy as BSL
-- | A single ABIF
data Abif = Abif { aHeader :: !Header
, aRootDir :: !Directory
, aDirs :: ![Directory]
} deriving (Show, Eq)
-- | ABIF header
data Header = Header { hName :: !Text
, hVersion :: !Int
} deriving (Show, Eq)
-- | ABIF directory entry.
-- The 'dData' field contains the data for the entry
data Directory = Directory { dTagName :: !Text -- ^ Tag name
, dTagNum :: !Int -- ^ Tag number, see e.g. how DATA entries use this
, dElemType :: !ElemType -- ^ Type of an element
, dElemTypeCode :: !Int -- ^ Integer value of 'dElemType'
, dElemTypeDesc :: !Text -- ^ Description of 'dElemType'
, dElemSize :: !Int -- ^ Size in bytes of each element
, dElemNum :: !Int -- ^ Number of elements in the data. See the spec per data type. E.g. for a string this is the number of characters
, dDataSize :: !Int -- ^ Number of bytes in the data
, dDataOffset :: !Int -- ^ Offset of this directory entry's data in the file. For data that is four
-- bytes or less, the data itself is stored in this field.
-- This value will be recalculated when writing an ABIF so you do not need to manually set it.
, dData :: !BSL.ByteString -- ^ The entry's data
, dDataDebug :: ![Text] -- ^ Optinal debug data, populated by 'Hyrax.Abif.Read.getDebug' when a ABIF is parsed
} deriving (Show, Eq)
These three types make up most of what we need to represent an ABIF. A few things to notice
Abif
type will point to the array of Directory
entriesdElemTypeCode
is the integer value read from the file (see the spec for the codes). dElemType
and dElemTypeDesc
are interpreted values from thisdElemSize
and dElemOffset
are read from the file, but are automatically calculated when writing (see the Hyrax.Abif.Write section below)dData
is the actual raw data read from the file, or data to be written to an ABIF filedDataDebug
is populated while reading the file and used during dumping to give human readable info about the file being inspected.The remaining code is the definition of ElemType
and functions for interpreting the raw element type integer value. Note that the spec defines a number of unsupported data types, these are included here.
readAbif
calls getAbif
to parse the data
-- | Read and parse an AB1 file
readAbif :: FilePath -> IO (Either Text Abif)
readAbif path = getAbif <$> BSL.readFile path
getAbif
starts the parsing of the Abif
data structure.
-- | Parse an AB1 from a 'ByteString'
getAbif :: BSL.ByteString -> Either Text Abif
getAbif bs = do
(header, rootDir) <- case B.runGetOrFail (getRoot bs) bs of
Right (_, _, x) -> pure x
Left (_, _, e) -> Left ("Error reading root: " <> Txt.pack e)
let dirBytes = BSL.drop (fromIntegral $ dDataOffset rootDir) bs
ds <- case B.runGetOrFail (getDirectories bs [] $ dElemNum rootDir) dirBytes of
Right (_, _, x) -> pure x
Left (_, _, e) -> Left ("Error reading "
<> show (dElemNum rootDir)
<> " directories (at " <> show (dDataOffset rootDir) <> "): "
<> Txt.pack e
)
pure $ Abif header rootDir ds
Either
monad is used so any Left
value will short-circuit out of the function and return the Left
value immediately.Data.Binary.runGetOrFail
returns a Left
if the get operation fails. Much better than getting an exception as you would with runGet
getDirectories
reads the number of directory entries specified by the root entry.getRoot
gets the header and root directory (see next section)
-- | Parse the root ('Header' and 'Directory')
getRoot :: BSL.ByteString -> B.Get (Header, Directory)
getRoot bs = do
h <- getHeader
rd <- getDirectory bs
pure (h, rd)
getHeader
gets the “ABIF” magic string and version number. Similar to the read example above.
-- | Parse the ABIF 'Header'
getHeader :: B.Get Header
getHeader =
Header <$> (TxtE.decodeUtf8 <$> B.getByteString 4)
<*> (fromIntegral <$> B.getInt16be)
getDirectories
is given the number of directories to read. It tries to read a single Directory
by calling getDirectory
and then recursively calls itself until done.
-- | Parse all the directoy entries
getDirectories :: BSL.ByteString -> [Directory] -> Int -> B.Get [Directory]
getDirectories _ acc 0 = pure acc
getDirectories bs acc more = do
d <- getDirectory bs
B.skip 4 -- Skip the reserved field
getDirectories bs (acc <> [d]) (more - 1)
Reading and individual directory is done by getDirectory
.
ByteString
.Abif
value-- | Parse a single 'Directory' entry and read its data
getDirectory :: BSL.ByteString -> B.Get Directory
getDirectory bs = do
tagName <- TxtE.decodeUtf8 <$> B.getByteString 4
tagNum <- fromIntegral <$> B.getInt32be
typeCode <- fromIntegral <$> B.getInt16be
elemSize <- fromIntegral <$> B.getInt16be
elemNum <- fromIntegral <$> B.getInt32be
dataSize <- fromIntegral <$> B.getInt32be
offsetDataBytes <- B.lookAhead $ B.getLazyByteString 4
dataOffset <- fromIntegral <$> B.getInt32be
-- Read the data
-- Data that is 4 bytes or less is stored in the offset field
dataBytes <- if dataSize <= 4
then pure $ BSL.take (fromIntegral dataSize) offsetDataBytes
else case B.runGetOrFail (B.getLazyByteString $ fromIntegral dataSize) $ BSL.drop (fromIntegral dataOffset) bs of
Right (_, _, x) -> pure x
Left (_, _, e) -> fail $ "error reading data ("
<> show dataSize
<> " bytes starting at "
<> show dataOffset
<> ") for directory entry '"
<> Txt.unpack tagName
<> "': "
<> e
let (elemType, elemCode) = describeElemType typeCode
pure Directory { dTagName = tagName
, dTagNum = tagNum
, dElemTypeCode = typeCode
, dElemTypeDesc = elemCode
, dElemType = elemType
, dElemSize = elemSize
, dElemNum = elemNum
, dDataSize = dataSize
, dDataOffset = dataOffset
, dData = dataBytes
, dDataDebug = []
}
PString
is prefixed with an Int8
size. So read the size and then the stringCString
is null terminated, so read all the data for the length of the string (from the directory entry) and drop the final null character.-- | Parse a 'ElemPString'
getPString :: B.Get Text
getPString = do
sz <- fromIntegral <$> B.getInt8
TxtE.decodeUtf8 <$> B.label ("PString length=" <> show sz <> ".") (B.getByteString sz)
-- | Parse a 'ElemCString'
getCString :: Int -> B.Get Text
getCString sz =
TxtE.decodeUtf8 <$> B.getByteString (sz - 1)
getDebug adds human readable information for some types, e.g. for strings. This lets us print the Abif
structure to the console with some useful data. Only a portion of getDebug
is show here as it is a little repetitive. However it is a good function to look at to see more examples of reading the raw data.
-- | Populate the directory entry with debug data (into 'dDataDebug').
-- This is done for selected types only, e.g. for strings so that printing the structure will display
-- readable/meaningfull info
getDebug :: Directory -> Directory
getDebug d =
let bsAtOffset = dData d in
case dElemType d of
-- Strings have a count = number of chars, not number of "strings"
ElemPString ->
if dDataSize d <= 4
then d { dDataDebug = [TxtE.decodeUtf8 . BSL.toStrict . BSL.drop 1 . BSL.take (fromIntegral $ dDataSize d) $ dData d] }
else d { dDataDebug = [B.runGet (lbl getPString) bsAtOffset] }
-- Strings have a count = number of chars, not number of "strings"
ElemCString ->
if dDataSize d <= 4
then d { dDataDebug = [TxtE.decodeUtf8 . BSL.toStrict . BSL.take (fromIntegral $ dDataSize d - 1) $ dData d] }
else d { dDataDebug = [B.runGet (lbl . getCString $ dDataSize d) bsAtOffset] }
When printing the structure it does not make sense to print all the raw data too. So the clear*
functions remove that before printing
-- | Removes all data from the ABIF's directories
clearAbif :: Abif -> Abif
clearAbif a = a { aRootDir = clear $ aRootDir a
, aDirs = clear <$> aDirs a
}
-- | Removes all data from a directory entry. This will probably only be useful when trying to show an ABIF value
clear :: Directory -> Directory
clear d = d { dData = "" }
As with the read functions there are two write functions for writing to ByteString
or to a file.
-- | Write an 'Abif' to a 'ByteString'
createAbifBytes :: Abif -> BSL.ByteString
createAbifBytes ab1 =
B.runPut (putAbif ab1)
-- | Write an 'Abif' to a file
writeAbif :: FilePath -> Abif -> IO ()
writeAbif destPath ab1 = do
let b = createAbifBytes ab1
BS.writeFile destPath $ BSL.toStrict b
Writing the ABIF data is relatively simple since each directory entry already contains the ByteString
raw data. putAbif
does need to recalculate the data size though
-- | Create the 'Abif' using "Data.Binary"
putAbif :: Abif -> B.Put
putAbif (Abif header root dirs) = do
-- Total data size
let dataSize = foldl' (\acc i -> if i > 4 then acc + i else acc) 0 $ dDataSize <$> dirs
-- Write the header
putHeader header
-- Data starts at offset 128
let startDataOffset = 128
-- Write the root directory entry
putDirectory (startDataOffset + dataSize) $ root { dDataSize = 28 * length dirs
, dElemNum = length dirs
}
-- Write 47 zero Int16 values as required by the spec
traverse_ B.putInt16be $ replicate 47 0
-- Write the data, for all data larger than four bytes. Data four bytes or less is stored
-- in the offset field
traverse_ (B.putLazyByteString . dData) $ filter (\d -> dDataSize d > 4) dirs
-- Write the directory entries.
foldM_ writeDir startDataOffset dirs
where
writeDir offset dir = do
putDirectory offset dir
pure $ if dDataSize dir > 4
then offset + dDataSize dir
else offset
Writing the header is pretty simple, write the magic string and version number.
-- | Write a 'Header'
putHeader :: Header -> B.Put
putHeader h = do
putTextStr $ hName h
B.putInt16be . fromIntegral $ hVersion h
There are two functions for writing Text
values
-- | Write 'Text'
putTextStr :: Text -> B.Put
putTextStr t = B.putByteString $ TxtE.encodeUtf8 t
-- | Write a 'ElemPString'
putPStr :: Text -> B.Put
putPStr t = do
B.putInt8 . fromIntegral $ Txt.length t
B.putByteString $ TxtE.encodeUtf8 t
When writing a directory there a few things to take care of
-- | Write a 'Directory'
putDirectory :: Int -> Directory -> B.Put
putDirectory dirOffset d = do
let name = Txt.justifyLeft 4 ' ' . Txt.take 4 $ dTagName d
putTextStr name
B.putInt32be . fromIntegral $ dTagNum d
B.putInt16be . fromIntegral $ dElemTypeCode d
B.putInt16be . fromIntegral $ dElemSize d
B.putInt32be . fromIntegral $ dElemNum d
B.putInt32be . fromIntegral $ dDataSize d
-- data with a size >= 4 are written in the offset
if dDataSize d > 4
then B.putInt32be . fromIntegral $ dirOffset
else B.putLazyByteString . BSL.take 4 $ dData d <> "\0\0\0\0"
B.putInt32be 0 -- reserved / datahandle
The mk*
set of functions help in constructing valid directory entries.
Below are two of these functions
mkBaseOrder
which creates a FWO_ Directory entry.mkLane
which creates a LANE Directory entry.As you can see these functions take appropriately typed values in and produce a valid directory entry for the data and directory type. (See Hyrax.Abif.Generate
to see them in use)
-- | Used to specify the base order for the FWO directry entry, see 'mkBaseOrder'
data Base = BaseA | BaseC | BaseG | BaseT
-- | Create a base order (FWO_) 'Directory' entry data
mkBaseOrder :: Base -> Base -> Base -> Base -> Directory
mkBaseOrder w x y z =
Directory { dTagName = "FWO_" -- Base order
, dTagNum = 1
, dElemTypeCode = 2
, dElemTypeDesc = "char"
, dElemType = ElemChar
, dElemSize = 1
, dDataOffset = 0
, dDataDebug = []
, dData = getBase w <> getBase x <> getBase y <> getBase z
, dDataSize = 4
, dElemNum = 4
}
where
getBase BaseA = "A"
getBase BaseC = "C"
getBase BaseG = "G"
getBase BaseT = "T"
-- | Create a lane (LANE) 'Directory' entry and data
mkLane :: Int16 -> Directory
mkLane lane =
Directory { dTagName = "LANE" -- Lane or capliary number
, dTagNum = 1
, dElemTypeCode = 4
, dElemTypeDesc = "short"
, dElemType = ElemShort
, dElemSize = 2
, dElemNum = 1
, dDataSize = 2
, dDataOffset = 0
, dData = B.runPut $ B.putInt16be lane
, dDataDebug = []
}
See the code or haddock for the full set of mk* functions.
addDirectory
appends a directory entry to an existing Abif
. See the examples to see this in use.
-- | Add a directory to an 'Abif'
addDirectory :: Abif -> Directory -> Abif
addDirectory abif dir =
abif { aDirs = aDirs abif <> [dir] }
Generating ABIFs is the main purpose of this package and the code to do this is in Hyrax.Abif.Generate
. There is less than 200 lines of code, but I’ll go through how it works in some detail.
generateAb1
is the main function in this module, it controls the flow of generating a single ABIF. It has the following high level concerns
-- | Create the 'ByteString' data for an AB1 given the data from a weighted FASTA (see 'readWeightedFasta')
generateAb1 :: (Text, [(Double, Text)]) -> BSL.ByteString
generateAb1 (fName, sourceFasta) =
let
tr = generateTraceData sourceFasta
valsPerBase = trValsPerBase tr
generatedFastaLen = (Txt.length $ trFasta tr)
-- The point that is the peak of the trace, i.e. mid point of trace for a single base
midPeek = valsPerBase `div` 2
-- Get the peak locations for all bases
peakLocations = take generatedFastaLen [midPeek, valsPerBase + midPeek..]
-- Sample name (from the FASTA name)
sampleName = fst . Txt.breakOn "_" $ fName
-- Create the ABIF directories
dirs = [ mkData 9 $ trData09G tr -- G
, mkData 10 $ trData10A tr -- A
, mkData 11 $ trData11T tr -- T
, mkData 12 $ trData12C tr -- C
, mkBaseOrder BaseG BaseA BaseT BaseC -- Base order, should be GATC for 3500
, mkLane 1 -- Lane or capliary number
, mkCalledBases $ trFasta tr -- Called bases
, mkMobilityFileName 1 "KB_3500_POP7_BDTv3.mob" -- Mobility file name
, mkMobilityFileName 2 "KB_3500_POP7_BDTv3.mob" -- Mobility file name
, mkPeakLocations $ fromIntegral <$> peakLocations -- Peak locations
, mkDyeSignalStrength 53 75 79 48 -- Signal strength per dye
, mkSampleName sampleName -- Sample name
, mkComment "Generated by HyraxBio AB1 generator"
]
-- The ABIF
abif = Abif { aHeader = mkHeader
, aRootDir = mkRoot
, aDirs = dirs
}
in
-- Generate the data
B.runPut (putAbif abif)
readWeightedFasta
reads the contents of a single weighted .fasta
file. (Unless you are interested in how the FASTA parsing works, you can skip this and go to the next section. Just have a look at what the types represent).
The parsed content has the type [('Double', 'Text')]
, which stores the data like this
[('Double', 'Text')]
^ ^
| |
| +---- read
|
+---- weight
i.e. an array of weights together with the sequence at that weight.
readWeightedFasta :: ByteString -> Either Text [(Double, Text)]
readWeightedFasta fastaData =
case parseFasta $ TxtE.decodeUtf8 fastaData of
Left e -> Left e
Right fs -> getWeightedFasta fs
where
getWeightedFasta :: [Fasta] -> Either Text [(Double, Text)]
getWeightedFasta fs =
case sequenceA $ readWeighted <$> fs of
Left e -> Left e
Right r -> Right r
readWeighted :: Fasta -> Either Text (Double, Text)
readWeighted (Fasta hdr' dta) =
let (processNucs, hdr) =
-- If there is a 'R' suffix, then generate a reverse sequence
-- Which means complement each nucleotide and then reverse the string
if Txt.isSuffixOf "R" hdr'
then (Txt.reverse . complementNucleotides, Txt.strip . Txt.dropEnd 1 $ hdr')
else (identity, hdr')
in
case (readMaybe . Txt.unpack $ hdr :: Maybe Double) of
Just weight -> Right (min 1 . max 0 $ weight, processNucs $ Txt.strip dta)
Nothing -> Left $ "Invalid header reading, expecting numeric weight, got: " <> hdr
The FASTA is read and parsed in Hyrax.Abif.Fasta
. Note that readWeighted
handles the reverse read logic by calling complementNucleotides
and then reversing the string. This section of the code is not entirely relevant for this discussion of the ABIF generation so I wont spend much time on it.
-- | Return the complement of a nucelotide string
complementNucleotides :: Text -> Text
complementNucleotides ns =
let
un = unIupac <$> Txt.unpack ns
comp = complementNuc <<$>> un
iu = iupac comp
in
Txt.pack iu
where
complementNuc 'A' = 'T'
complementNuc 'G' = 'C'
complementNuc 'T' = 'A'
complementNuc 'C' = 'G'
complementNuc x = x
-- | FASTA data
data Fasta = Fasta { fastaName :: !Text -- ^ Name
, fastaRead :: !Text -- ^ Data
} deriving (Show, Eq)
-- | Parse the data for a single FASTA into a list of 'Fasta' values.
-- Single and multi-line FASTAs are supported.
-- Used by "Hyrax.Abif.Generate" to read weighted-FASTAs
parseFasta :: Text -> Either Text [Fasta]
parseFasta s =
reverse <$> go (Txt.lines s) Nothing "" []
where
go :: [Text] -> Maybe Text -> Text -> [Fasta] -> Either Text [Fasta]
go (line:lines) (Just name) read acc =
if Txt.take 1 line /= ">"
then go lines (Just name) (read <> line) acc
else go lines (Just $ Txt.drop 1 line) "" (Fasta (Txt.strip name) read : acc)
go (line:lines) Nothing _read acc =
if Txt.take 1 line == ">"
then go lines (Just $ Txt.strip . Txt.drop 1 $ line) "" acc
else Left "Expecting name"
go [] Nothing _ acc =
Right acc
go [] (Just _name) "" _acc =
Left "Expecting read"
go [] (Just name) read acc =
Right $ Fasta (Txt.strip name) read : acc
readWeightedFastas
reads all the FASTA files from a directory and returns a tuple of ( file-name, f )
where f
is [('Double', 'Text')]
as described above.
-- | Read all FASTA files in a directory
--
-- The result data has the type
--
-- @
-- [ ('Text', [('Double', 'Text')]) ]
-- ^ ^ ^
-- | | |
-- file name -------------+ | +---- read
-- |
-- +---- weight
-- @
--
readWeightedFastas :: FilePath -> IO (Either Text [(Text, [(Double, Text)])])
readWeightedFastas source = do
files <- filter (Txt.isSuffixOf ".fasta" . Txt.pack) <$> getFiles source
let names = Txt.pack . FP.takeBaseName <$> files
contents <- traverse BS.readFile files
case sequenceA $ readWeightedFasta <$> contents of
Left e -> pure . Left $ e
Right rs -> pure . Right $ zip names rs
generateTraceData
does the bulk of the work in the ABIF data generation
-- | Generate the traces for the AB1 from the parsed weighted FASTA
generateTraceData :: [(Double, Text)] -> TraceData
generateTraceData weighted =
weightedNucs' = (\(w, ns) -> (w,) . unIupac <$> Txt.unpack ns) <$> weighted
weightedNucs = Lst.transpose weightedNucs'
Lets break (\(w, ns) -> (w,) . unIupac <$> Txt.unpack ns) <$> weighted
down a bit
lambda <$> weighted
[(Double, Text)]
as discussed above\(w, ns)
. I.e. it destuctures a tuple from the array and gets the weight and the string of nucleotides.f <$> Txt.unpack ns
f
is (w,) . unIupac
unIupac
(as a Text
) and added to a tuple with the weigh, so (weight, [nucleotide])
unIupac
takes a possibly ambiguous nucleotide code and returns the list of nucleotides it represents. E.g. V
-> ACG
List.transpose
is called. This gives us all the nulceotides and weights per positionThis code is perhaps a bit hard to follow, so here is an example showing how this would work for the weighted FASTA
>1
AC
>0.5
WK
unIupac
, and since W
= AT
and K
= GT
we get0
has an A
with weight 1 and an A
/T
with weight 0.51
has a C
with weight 1 and a G
/T
with weight 0.5 -- Values for a base that was present. This defines the shape of the chromatogram curve,
-- and defines the number of values per base
curve = [0, 0, 128, 512, 1024, 1024, 512, 128, 0, 0]
valsPerBase = length curve
Next the shape of the curve is defined. A curve this shape, was selected as it has some space either side to avoid mixing with neighboring waves and a steep climb so that the peak is easily detectable.
-- Create the G, A, T and C traces
data09G = concat $ getWeightedTrace curve 'G' <$> weightedNucs
data10A = concat $ getWeightedTrace curve 'A' <$> weightedNucs
data11T = concat $ getWeightedTrace curve 'T' <$> weightedNucs
data12C = concat $ getWeightedTrace curve 'C' <$> weightedNucs
getWeightedTrace :: [Int] -> Char -> [(Double, [Char])] -> [Int16]
getWeightedTrace curve nuc ws =
let
found = filter ((nuc `elem`) . snd) ws
score' = foldl' (+) 0 $ fst <$> found
score = min 1 . max 0 $ score'
wave = floor . (score *) . fromIntegral <$> curve
in
wave
getWeightedTrace
is then called for each of the four bases. For each position for a base it returns a curve. If the position does not have the base then the curve is flat (zeros), if it does the curve above is returned multiplied by the weight.
Again an example may make this easier to understand
let ns = [ [ (1.0, "A") -- Position 0
, (0.5, "AT")
]
,
[ (1.0, "C") -- Position 1
, (0.5, "GT")
]
]
let curve = [0, 100, 0]
let rA = getWeightedTrace curve 'A' <$> ns
print rA
let rG = getWeightedTrace curve 'G' <$> ns
print rG
getWeightedTrace
for the A
and G
basesFor A
A
at position 0
with a total weight of 1
(remember max is 1.0) so the full curve is usedA
at position 1
For G
G
at position 0
G
at position 1
with a weight of 0.5
so each value in the wave is multiplied by 0.5Notice that in the code above, these results are then concatenated so the actual results are
With that have a way to generate a wave form for the input weighted fasta
-- Create fasta sequence for the trace
fastaSeq = concat <$> (snd <<$>> weightedNucs)
fasta = Txt.pack $ iupac fastaSeq
The ABIF needs to store the called bases in the PBAS entry. We get the bases from the input data, IUPAC encode each position and we have the sequence.
TraceData { trData09G = data09G
, trData10A = data10A
, trData11T = data11T
, trData12C = data12C
, trFasta = fasta
, trValsPerBase = valsPerBase
}
And return the TraceData
value
For completeness here is the unIupac
function
-- | Convert a IUPAC ambiguity code to the set of nucleotides it represents
unIupac :: Char -> [Char]
unIupac c =
case c of
'T' -> "T"
'C' -> "C"
'A' -> "A"
'G' -> "G"
'U' -> "T"
'M' -> "AC"
'R' -> "AG"
'W' -> "AT"
'S' -> "CG"
'Y' -> "CT"
'K' -> "GT"
'V' -> "ACG"
'H' -> "ACT"
'D' -> "AGT"
'B' -> "CGT"
'N' -> "GATC"
'X' -> "GATC"
_ -> ""
-- The point that is the peak of the trace, i.e. mid point of trace for a single base
midPeek = valsPerBase `div` 2
-- Get the peak locations for all bases
peakLocations = take generatedFastaLen [midPeek, valsPerBase + midPeek..]
To generate the array of peak locations
Given a curve of [0, 10, 10, 0]
valsPerBase
= the length of the curve = 4midPeak
= 4 / 2 = 2[2, 6, 10, 14, 18......
, one element per length of the input FASTA.PLOC
directory entry.We now have all the data we need, the mk*
functions are used to generate the minimal set of directories
-- Create the ABIF directories
dirs = [ mkData 9 $ trData09G tr -- G
, mkData 10 $ trData10A tr -- A
, mkData 11 $ trData11T tr -- T
, mkData 12 $ trData12C tr -- C
, mkBaseOrder BaseG BaseA BaseT BaseC -- Base order, should be GATC for 3500
, mkLane 1 -- Lane or capliary number
, mkCalledBases $ trFasta tr -- Called bases
, mkMobilityFileName 1 "KB_3500_POP7_BDTv3.mob" -- Mobility file name
, mkMobilityFileName 2 "KB_3500_POP7_BDTv3.mob" -- Mobility file name
, mkPeakLocations $ fromIntegral <$> peakLocations -- Peak locations
, mkDyeSignalStrength 53 75 79 48 -- Signal strength per dye
, mkSampleName sampleName -- Sample name
, mkComment "Generated by HyraxBio AB1 generator"
]
-- The ABIF
abif = Abif { aHeader = mkHeader
, aRootDir = mkRoot
, aDirs = dirs
}
And with that we can generate any test ABIF we need. The code is much shorter than the explanation. Hopefully with the guidance from this post and the code comments it should be easy enough to follow.
The package comes with property tests that test
We used Hedgehog for the property tests. It made writing the properties & generators (see the Generators
module) really easy.
For more details see the property tests and the following Hedgehog links
Hopefully you find this package useful, either as a standalone tool or as a library. If you do we would love to hear how you are using it.
If you have any questions feel free to email me.
Thanks
Andre.
master
branchblog
branch, commit = 3fa4f873bcb54756d520f9d2f2cd4995aeccaa4c
In this series of posts I’m going to discuss some of the major design decisions that you will need to consider when making a Haskell roguelike game. I’ll be talking about how the code is implemented as well as the decisions I took along the way and why.
See the chapter list below to get an idea of what topics I’ll be covering.
RogueBasin defines a roguelike as
A Roguelike is usually described as a free turn-based computer game with a strong focus on intricate gameplay and replayability, and an abstract world representation using ASCII-based display, as opposed to 3D graphics. Of course, as with any genre, there are deviations from the norm.
There are several good Haskell roguelike tutorials and tools. The ones I know about are
Each of the tutorials takes a different approach, cover different topics and address different levels of familiarity with haskell. The more the merrier! If you are interested in building a roguelike, or just looking for ideas for writing games in Haskell, then looking at the links above as well as this series seems like a good idea.
LambdaHack is a game engine for making ASCII roguelike games. It is definitely worth taking a look at. For this series I decided not to use LambdaHack as it already implements many of the things I was interested in learning about. This is a trade-off you’ll have to make. Are you currently more interested in learning how to build the game or more interested in getting a robust game together? I’d suggest learning with a small roguelike first as you’ll then have a much better idea of what a game engine offers or how it may constrain you.
RogueBasin has a vast amount of detail on creating roguelike games. Its well worth your time to look at for ideas. I’ll be referring to articles from roguebasin throughout the series
Apart from just being another perspective, I think that these are some reasons you may want to continue reading
Also I’m not a gamer and this is the first game I’ve ever written, so that should bring a slightly different perspective ;)
There are 20 “chapters”. Each chapter builds on the previous one. Each chapter’s code is available on github and has a working example. At the end of each chapter is a patch file that shows what was added/changed.
I’ve tried to keep the Haskell code simple, hopefully it is easy to follow even if you don’t have much experience. E.g. There is only one transformer used (chapter 14) and that is entirely optional.
I am using Protolude as my prelude. It should be fairly easy to switch to something else. Note that protolude defines <<$>>
which is the same as (fmap . fmap)
or <$$>
from composition-extra
I am using lenses, since there are several nested record types I use. If you have not used lens before it may seem odd. I’m mostly using only three lenses, so you should be able to follow along without worrying about them too much.
If you are unfamiliar with lenses here are some links that may help get you started quickly,
And here are a few quick examples that may help you get a sense for what they do.
Given this definition
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TemplateHaskell #-}
module Main where
import Protolude
import qualified Data.Text as Txt
import Control.Lens
import Control.Lens.TH
data Parent = Parent { _pName :: Text
, _pChild :: Child
} deriving (Show)
data Child = Child { _cName :: Text
} deriving (Show)
makeLenses ''Parent
makeLenses ''Child
main :: IO ()
main = do
let parent = Parent { _pName = "parent1"
, _pChild = Child { _cName = "child1" }
}
The ^.
lens acts as a field getter
# With lens
putText $ parent ^. pName
putText $ parent ^. pChild ^. cName
# Without lens
putText $ _pName parent
putText $ _cName . _pChild $ parent
Which both print
parent1
child1
There is not too much difference between the two styles, yet.
The .~
lens acts as a field setter
# With lens
print $ parent & pName .~ "newName"
print $ parent & (pChild . cName) .~ "new childName"
# Without lens
print $ parent { _pName = "newName" }
print $ parent { _pChild = (_pChild parent) { _cName = "new childName" } }
Which both print
Parent {_pName = "newName", _pChild = Child {_cName = "child1"}}
Parent {_pName = "parent1", _pChild = Child {_cName = "new childName"}}
Once you start updating nested records, I think the lens code is much easier to read. The deeper the nesting the more true this is.
The %~
lens acts as a field updater that works by sending the current value through a function
# With lens
print $ parent & pName %~ Txt.toUpper
print $ parent & (pChild . cName) %~ Txt.toUpper
# Without lens
print $ parent { _pName = Txt.toUpper . _pName $ parent }
print $ parent { _pChild = (_pChild parent) { _cName = Txt.toUpper . _cName . _pChild $ parent } }
Which both print
Parent {_pName = "PARENT1", _pChild = Child {_cName = "child1"}}
Parent {_pName = "parent1", _pChild = Child {_cName = "CHILD1"}}
Here I’d say the lens code is significantly easier to read, i.e. the intent is much clearer.
Quite often you’ll end up combining these three lenses and the alternative, in my view, is way too noisy. There are some tradeoffs using lens, e.g. template haskell but I feel that it is justified by the resulting code. Obviously if you prefer not to use lens you can do everything reasonably easily without it.
(updated: 09/March/2018 - includes, local path & html, css and js support)
Ensuring that the code you include in a blog post is up to date and works can be a bit of a pain. Often I’ll change code while writing a post and then I have to find and copy anything that has changed. This is manual and error prone.
Fortunately Hakyll is reasonably easy to customise. Here I’ll show one way to write a hakyll compiler to help with this issue.
What I wanted was
---
title: testing
---
[<code setup.repo>] https://gist.github.com/53e179c4244411493ae1f9deebc3cc3f.git
[<code setup.sha>] 5a95ece18ecb248fb745b3e7cb19f5c4d410240f
[<code setup.run>] stack init --resolver lts-12.0
[<code setup.run>] stack build
[<code setup.run>] stack test
Some text
[<code>] main
more text
The markdown should be parsed as follows
[<code setup.repo>]
- is the git repo to pull the code from[<code setup.sha>]
- is the commit to work with[<code setup.run>]
- any number of commands to run in order.[<code>] main
- gets the section named main
from the code and inserts it as a markdown code block{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
import Protolude hiding (onException)
import System.FilePath ((</>))
import qualified System.FilePath as FP
import qualified System.Directory as Dir
import Control.Exception.Safe (onException, throwString)
{-! SECTION< main !-}
main :: IO ()
main = hakyll $ do
match "posts/*" $ do
route $ setExtension "html"
compile $ includeCodeCompiler
>>= renderPandoc
{-! SECTION> main !-}
{-! SECTION<
starts a code section{-! SECTION>
ends a code sectionWhen pandoc is run the include compiler will insert the code from the main
section and add a title showing the source path (repo relative) and the position (line from & to).
main :: IO ()
main = hakyll $ do
match "posts/*" $ do
route $ setExtension "html"
compile $ includeCodeCompiler
>>= renderPandoc
Before including the code, the includeCompiler will checkout the code and run the commands specified in the template. In the example template above I’m cloning from a github gist that does not have a stack.yaml so I run stack init
first. You can use the commands to run tests etc to ensure that your sample code is working 100%. If any action fails, the blog generation is aborted.
The hakyll tutorial will give you an idea of how to setup hakyll.
This is a fairly standard match clause to run your posts through pandoc to generate HTML output
match "posts/*" $ do
route $ setExtension "html"
compile $ pandocCompiler
>>= loadAndApplyTemplate "templates/post.html" postCtx
>>= saveSnapshot "content"
>>= loadAndApplyTemplate "templates/default.html" postCtx
>>= relativizeUrls
Lets modify this route to use a new compiler named includeCodeCompiler and pipe that output through pandoc
match "posts/*" $ do
route $ setExtension "html"
compile $ includeCodeCompiler
>>= renderPandoc
>>= loadAndApplyTemplate "templates/post.html" postCtx
>>= saveSnapshot "content"
>>= loadAndApplyTemplate "templates/default.html" postCtx
>>= relativizeUrls
The two changes to notice are
Here are the imports I’m using
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE LambdaCase #-}
import Protolude hiding (onException)
import Prelude (String)
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as Map
import qualified Data.List as Lst
import qualified Data.Text as Txt
import qualified Data.Text.IO as Txt
import Data.Monoid (mappend)
import Hakyll
import qualified System.Exit as Xit
import qualified System.Process as Proc
import System.FilePath ((</>))
import qualified System.FilePath as FP
import qualified System.Directory as Dir
import Control.Exception.Safe (onException, throwString)
And a few helper functions for running shell processes and finding files
runShell' :: FilePath -> Text -> IO ()
runShell' workingDir cmd = do
putText cmd
runShell workingDir cmd >>= \case
Right _ -> pass
Left e -> throwString $ Txt.unpack "Error running `" <> Txt.unpack cmd <> "` " <> show e
runShell :: FilePath -> Text -> IO (Either Int ())
runShell workingDir cmd = do
let p = Proc.shell $ Txt.unpack cmd
(_, _, _, phandle) <- Proc.createProcess p { Proc.cwd = Just workingDir }
Proc.waitForProcess phandle >>= \case
Xit.ExitSuccess -> pure $ Right ()
Xit.ExitFailure i -> pure $ Left i
getFilesRec :: FilePath -> IO [FilePath]
getFilesRec p = do
fs <- (p </>) <<$>> getFiles p
ds <- (p </>) <<$>> getDirs p
cs <- traverse getFilesRec ds
pure $ fs <> join cs
getDirs :: FilePath -> IO [FilePath]
getDirs p = do
entries <- (p </>) <<$>> Dir.listDirectory p
filterM Dir.doesDirectoryExist entries
getFiles :: FilePath -> IO [FilePath]
getFiles p = do
entries <- (p </>) <<$>> Dir.listDirectory p
filterM Dir.doesFileExist entries
includeCodeCompiler :: Compiler (Item String)
includeCodeCompiler = do
p <- getResourceFilePath
getResourceString >>= withItemBody (unsafeCompiler . includeCompile p)
where
includeCompile :: FilePath -> String -> IO String
includeCompile compilingPath source =
includeCompile' source
`onException`
putStr ("Exception compiling includes for: " <> compilingPath)
includeCompile' :: String -> IO String
includeCompile' source = do
let ls1 = Txt.lines $ Txt.pack source
let (sourceNoSetup, repoPath', sha', cmds', path') = getConfig ls1
A pandoc compiler has the type Compiler (Item String)
. Since this compiler needs to read file it has to be able to perform IO. To allow IO the unsafeCompiler
function is used.
So this code, gets the current file path, the item body and starts the includeCompile in IO
onException
is used to print the name of the file being compiled if there is an exception.
Once the config (repo, sha and commands) have been read the main compiler logic can be run.
case path' of
Nothing ->
case (repoPath', sha', cmds') of
(Nothing, Nothing, []) -> pure $ Txt.unpack . Txt.unlines $ sourceNoSetup
(Just _, Nothing, _) -> throwString "No sha found"
(Just _, _, []) -> throwString "No run commands found"
(Nothing, _, (_:_)) -> throwString "No repo setup found"
(Nothing, Just _, []) -> throwString "No repo setup found"
(Just repoPath, Just sha, cmds) -> do
root <- Dir.getCurrentDirectory
let tempPath = root </> tmpDirectory defaultConfiguration </> "codeIncludeGit"
-- Cleanup from previous post
removeDirectoryRecursiveIfExists tempPath
Dir.createDirectoryIfMissing True tempPath
-- Clone the git repo
runShell' root $ "git clone \"" <> repoPath <> "\" \"" <> Txt.pack tempPath <> "\""
-- Goto the correct sha (if it was specified)
gotoSha sha tempPath
-- Execute the run commands (buid, test etc)
executeRunCommands cmds tempPath
-- Delete all dirs we are not interested in (exclude .git and .stack-work)
removeDirectoryRecursiveIfExists $ tempPath </> ".git"
removeDirectoryRecursiveIfExists $ tempPath </> ".stack-work"
includeCode tempPath repoPath sha sourceNoSetup
Just path ->
includeCode (Txt.unpack path) "**local**" "**local**" sourceNoSetup
includeCode tempPath repoPath sha sourceNoSetup = do
-- Get all files in the repo
files <- getFilesRec tempPath
-- All sections from all files
sections' <- Map.fromList . concat <$> traverse getSections files
let sections = Map.map (\(p, s, e, lang, ls) -> (drop (length tempPath + 1) p, s, e, lang, ls)) sections'
-- Replace sections in the file
replaced' <- traverse (replaceCodeLineSection tempPath sections) sourceNoSetup
let replaced = Txt.unlines . concat $ replaced'
-- Replace sha and repo tokens
pure . Txt.unpack . Txt.replace "2297510b93a903ab23a319f7921351a9725cef0e" sha $ Txt.replace "https://gist.github.com/53e179c4244411493ae1f9deebc3cc3f.git" repoPath replaced
executeRunCommands :: [Text] -> FilePath -> IO ()
executeRunCommands cmds path =
traverse_ (runShell' path) cmds
gotoSha :: Text -> FilePath -> IO ()
gotoSha sha tmpPath = do
runShell' tmpPath ("git reset " <> sha <> " --hard")
void $ runShell tmpPath "git clean -dfx"
This code does the following
Loading the config is done quite simply by filtering the source lines
getConfig ls =
let
cfgPath = "[<code setup.path>]"
cfgRepo = "[<code setup.repo>]"
cfgSha = "[<code setup.sha>]"
cfgRun = "[<code setup.run>]"
path = Txt.strip . Txt.drop (Txt.length cfgRepo) <$> headMay (filter (Txt.isPrefixOf cfgPath) ls)
repo = Txt.strip . Txt.drop (Txt.length cfgRepo) <$> headMay (filter (Txt.isPrefixOf cfgRepo) ls)
sha = Txt.strip . Txt.drop (Txt.length cfgSha) <$> headMay (filter (Txt.isPrefixOf cfgSha) ls)
run = Txt.strip . Txt.drop (Txt.length cfgRun) <$> filter (Txt.isPrefixOf cfgRun) ls
in
(filter (not . Txt.isPrefixOf "[<code setup.") ls, repo, sha, run, path)
removeDirectoryRecursiveIfExists p =
Dir.doesDirectoryExist p >>= \case
True -> Dir.removeDirectoryRecursive p
False -> pass
And once the sections have been loaded from the source code the tags can be replaced in the markdown. Each [<code>]
tag is replaced by a markdown code block, a title showing the source file and offset.
replaceCodeLineSection :: FilePath -> Map Text (FilePath, Int, Int, Text, [Text]) -> Text -> IO [Text]
replaceCodeLineSection tempPath sections line = do
let codeTag = "[<code>]"
let includeTag = "[<include>]"
if Txt.isPrefixOf codeTag line
then
let secName = Txt.strip . Txt.drop (Txt.length codeTag) $ line in
case Map.lookup secName sections of
Nothing -> throwString $ Txt.unpack $ "No section named " <> secName
Just (path, start, end, lang, code) ->
let title = Txt.pack path <> " (" <> show start <> " to " <> show end <> ")" in
pure [ "###### " <> title
, ""
, "~~~{." <> lang <> "}"
, Txt.unlines code
, "~~~"
, ""
]
else
if not $ Txt.isPrefixOf includeTag line
then pure [line]
else
let
incRelPath = Txt.strip . Txt.drop (Txt.length includeTag) $ line
incFullPath = tempPath </> Txt.unpack incRelPath
in
sequenceA [Txt.readFile incFullPath]
type LineState = (Int, [(Text, (FilePath, Int, Int, Text, [Text]))])
getSections :: FilePath -> IO [(Text, (FilePath, Int, Int, Text, [Text]))]
getSections f =
case FP.takeExtension f of
".hs" -> getLangSections "{-! SECTION< " "{-! SECTION> " "{-! SECTION" "haskell"
".js" -> getLangSections "//!SECTION< " "//!SECTION> " "//!SECTION" "javascript"
".html" -> getLangSections "<!-- !SECTION+ " "<!-- !SECTION- " "<!-- !SECTION" "html"
".css" -> getLangSections "/* !SECTION< " "/* !SECTION> " "/* !SECTION" "html"
_ -> pure []
where
getLangSections startToken endToken cleanToken lang = do
ls <- Txt.lines <$> Txt.readFile f
(_, r) <- foldlM (parseLine ls) (1, []) ls
pure r
where
parseLine :: [Text] -> LineState -> Text -> IO LineState
parseLine ls (lineNum, hist) l =
if not . Txt.isPrefixOf startToken . Txt.strip $ l
then pure (lineNum + 1, hist)
else do
let secName = Txt.strip . fst . Txt.breakOn " " . Txt.strip . Txt.drop (Txt.length startToken) . Txt.strip $ l
end <- scanForEnd ls secName lineNum
pure (lineNum + 1, (secName, (f, lineNum + 1, lineNum + length end, lang, end)) : hist)
scanForEnd ls secName fromLine =
let fromOffset = drop fromLine ls in
case Lst.span (not . Txt.isPrefixOf (endToken <> secName) . Txt.strip) fromOffset of
(_, []) -> throwString $ "No section end found for: " <> Txt.unpack secName
(r, _) -> pure $ filter cleanLine r
cleanLine =
not . Txt.isPrefixOf cleanToken . Txt.strip
Different types of files will need different tag styles. In the code above I’m handling haskell, javascript, css and HTML. You should be able to fairly easily add this to other languages as well.
parseLine
works by going line by line looking for a start token, and for each one that it finds it scans to find the end token. This is a little inefficient but it allows for nested and/or overlapping tags.
Sometimes it is useful to include external files into a post. The [<include>]
tag makes this simple.
For example [<include>] /home/user/static/interestingStuff.json
. Unlike the [<code>]
tag, no assumptions are made about the included text. If you want it syntax highlighted simply wrap the text in a code fence.
Fetching the code from a remote repo and doing a full build each time can be pretty slow. This is fine when you are confirming that everything works correctly, but its not idea when you are writing a post and still making many small changes. To help with this there is a [<code setup.path>]
tag. This tag overrides the repo and run settings. If it is present then all code sections will be read from this path directly without any fetching, building or running of commands.
e.g. [<code setup.path>] /home/user/dev/myProject
Obviously it is important that you remove this setting once the post is done.
Once you add this to your hakyll you can be sure that you are only using working code blocks. While there is a bit of code in this compiler most of it is for dealing with the file IO and parsing. I think it also shows how easily hakyll can be customised to do useful things.
This code works with hakyll 4.10.0.0. See the cabal file in the gist for other dependencies
bhoogle is a simple hoogle terminal GUI written using brick. This post is the annotated source code that should give you an idea of how to use brick and how easy brick makes building terminal UIs.
bhoogle is possibly useful as a local hoogle UI as well as a demo app. You can get the full code from github.
You will need an existing local hoogle database. If you do not already have one or are unsure, then do this
stack install hoogle
)hoogle generate
)You can then clone the code, or download one of the pre-build linux releases
There are a few conventions to get used to when building a brick UI, but I don’t think it should take you too long to get the hang of things.
The brick user guide and documentation are fantastic. Brick comes with multiple example apps that show controls and features being used. There are also third party tutorials e.g. Samuel Tay’s brick tutorial
If you have looked at the user guide or Samuel Tay’s tutorial you’ll already have some idea of the fundamental concepts. Below is the annotated source for bhoogle. As always feel free to email or contact me on twitter if anything is unclear and I’ll do my best to assist.
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TemplateHaskell #-}
module Main where
import Protolude
import Control.Lens ((^.), (.~), (%~))
import Control.Lens.TH (makeLenses)
import qualified Data.List as Lst
import qualified Data.Time as Tm
import qualified Data.Text as Txt
import qualified Data.Vector as Vec
import Brick ((<+>), (<=>))
import qualified Brick as B
import qualified Brick.BChan as BCh
import qualified Brick.Focus as BF
import qualified Brick.AttrMap as BA
import qualified Brick.Widgets.List as BL
import qualified Brick.Widgets.Edit as BE
import qualified Brick.Widgets.Border as BB
import qualified Brick.Widgets.Border.Style as BBS
import Control.Concurrent (threadDelay, forkIO)
import qualified Graphics.Vty as V
import qualified Graphics.Vty.Input.Events as K
import qualified Hoogle as H
Import all the modules we’ll need. I’m using protolude as my custom prelude, changing to one of the others e.g. classy should be pretty simple if you prefer that.
I’m also using lens. The brick examples use lens so its worth getting used to. However I’m only using three of the simpler lenses, so if you don’t like lens or template haskell it should be easy enough to remove them.
-- | Events that can be sent
-- | Here there is just one event for updating the time
newtype Event = EventUpdateTime Tm.LocalTime
-- | Names use to identify each of the controls
data Name = TypeSearch
| TextSearch
| ListResults
deriving (Show, Eq, Ord)
Next we need to define the type of custom events that our brick application can handle and a sum type defining the “name” for each control we want to use.
In this example there is only a single event EventUpdateTime. It is sent once a second with the current time. This gets displayed by brick in the top right corner
There are three controls
-- | Sort order
data SortBy = SortNone
| SortAsc
| SortDec
deriving (Eq)
-- | State of the brick app. Contains the controls and any other required state
data BrickState = BrickState
{ _stEditType :: !(BE.Editor Text Name) -- ^ Editor for the type to search for
, _stEditText :: !(BE.Editor Text Name) -- ^ Editor for a text search in the results
, _stResultsList :: !(BL.List Name H.Target) -- ^ List for the search results
, _stFocus :: !(BF.FocusRing Name) -- ^ Focus ring - a circular list of focusable controls
, _stTime :: !Tm.LocalTime -- ^ The current time
, _stResults :: [H.Target] -- ^ The last set of search results from hoohle
, _stSortResults :: SortBy -- ^ Current sort order for the results
}
makeLenses ''BrickState
BrickState contains the current state of the brick application. Any event e.g. the custom update time event, or any key press event can result in the state being updated. There is a separate draw function that renders the state.
I.e. one part of the code deals with events, roughly state -> event -> state
and another handles the drawing state -> GUI
Here the state contains
-- | Defines how the brick application will work / handle events
app :: B.App BrickState Event Name
app = B.App { B.appDraw = drawUI
, B.appChooseCursor = B.showFirstCursor
, B.appHandleEvent = handleEvent
, B.appStartEvent = pure
, B.appAttrMap = const theMap
}
The App type defines how the brick app operates, but defining how events are handled (appHandleEvent
) and how the GUI is drawn (appDraw
)
main :: IO ()
main = do
chan <- BCh.newBChan 5 -- ^ create a bounded channel for events
-- Send a tick event every 1 seconds with the current time
-- Brick will send this to our event handler which can then update the stTime field
void . forkIO $ forever $ do
t <- getTime
BCh.writeBChan chan $ EventUpdateTime t
threadDelay $ 1 * 1000000
-- Initial current time value
t <- getTime
-- Construct the initial state values
let st = BrickState { _stEditType = BE.editor TypeSearch (Just 1) ""
, _stEditText = BE.editor TextSearch (Just 1) ""
, _stResultsList = BL.list ListResults Vec.empty 1
, _stTime = t
, _stFocus = BF.focusRing [TypeSearch, TextSearch, ListResults]
, _stResults = []
, _stSortResults = SortNone
}
-- Run brick
void $ B.customMain (V.mkVty V.defaultConfig) (Just chan) app st
where
-- | Get the local time
getTime = do
t <- Tm.getCurrentTime
tz <- Tm.getCurrentTimeZone
pure $ Tm.utcToLocalTime tz t
In main some setup is preformed and then brick is started by calling customMain
.
For bhoogle the steps are
customMain
)B.customMain
to run brick-- | Main even handler for brick events
handleEvent :: BrickState -> B.BrickEvent Name Event -> B.EventM Name (B.Next BrickState)
handleEvent st ev =
case ev of
(B.AppEvent (EventUpdateTime time)) ->
-- Update the time in the state
B.continue $ st & stTime .~ time
handleEvent gets all the brick events, updates the state and decides how to continue.
Here the code matches the custom (B.AppEvent) event looking for our update time event (EventUpdateTime) and then updates the state with the current time. B.continue
means that brick continues after updating the state. Note that the UI is not changed in any way here, we are just altering the current state.
-- Handle keyboard events
-- k is the key
-- ms are the modifier keys
(B.VtyEvent ve@(V.EvKey k ms)) ->
case (k, ms) of
-- Escape quits the app, no matter what control has focus
(K.KEsc, []) -> B.halt st
Then the code matches any keyboard event (B.VtyEvent) here matching on the escape key (K.KEsc). So when the user clicks the escape key this handler will call B.halt
which will terminate the app. As this is done at the top level, this means that no matter which control has the focus, escape will exit.
_ ->
-- How to interpret the key press depends on which control is focused
case BF.focusGetCurrent $ st ^. stFocus of
For the rest of the key press logic, what bhoogle does depends on which control has the focus. BF.focusGetCurrent
is used to get that from the state’s focus ring.
Just TypeSearch ->
case k of
K.KChar '\t' -> do
-- Search, clear sort order, focus next
found <- doSearch st
B.continue . filterResults $ st & stFocus %~ BF.focusNext
& stResults .~ found
& stSortResults .~ SortNone
K.KBackTab ->do
-- Search, clear sort order, focus prev
found <- doSearch st
B.continue . filterResults $ st & stFocus %~ BF.focusPrev
& stResults .~ found
& stSortResults .~ SortNone
If the user is typing in the “type” edit box and tabs out (either tab or shift-tab) then
K.KEnter -> do
-- Search, clear sort order, focus on results
-- This makes it faster if you want to search and navigate
-- results without tabing through the text search box
found <- doSearch st
B.continue . filterResults $ st & stResults .~ found
& stSortResults .~ SortNone
& stFocus %~ BF.focusSetCurrent ListResults
If the user presses enter while in the type search edit box, then
_ -> do
-- Let the editor handle all other events
r <- BE.handleEditorEvent ve $ st ^. stEditType
next <- liftIO . searchAhead doSearch $ st & stEditType .~ r
B.continue next
For all other key events for the type search, let the editor control handle the key press. This gives us editing, navigation etc for free.
Just TextSearch ->
case k of
K.KChar '\t' -> B.continue $ st & stFocus %~ BF.focusNext -- Focus next
K.KBackTab -> B.continue $ st & stFocus %~ BF.focusPrev -- Focus previous
_ -> do
-- Let the editor handle all other events
r <- BE.handleEditorEvent ve $ st ^. stEditText
B.continue . filterResults $ st & stEditText .~ r
For the text edit box
Just ListResults ->
case k of
K.KChar '\t' -> B.continue $ st & stFocus %~ BF.focusNext -- Focus next
K.KBackTab -> B.continue $ st & stFocus %~ BF.focusPrev -- Focus previous
K.KChar 's' ->
-- Toggle the search order between ascending and descending,
-- use asc if sort order was 'none'
let sortDir = if (st ^. stSortResults) == SortAsc
then SortDec
else SortAsc
in
let sorter = if sortDir == SortDec
then (Lst.sortBy $ flip compareType)
else (Lst.sortBy compareType)
in
B.continue . filterResults $ st & stResults %~ sorter
& stSortResults .~ sortDir
_ -> do
-- Let the list handle all other events
-- Using handleListEventVi which adds vi-style keybindings for navigation
-- and the standard handleListEvent as a fallback for all other events
r <- BL.handleListEventVi BL.handleListEvent ve $ st ^. stResultsList
B.continue $ st & stResultsList .~ r
_ -> B.continue st
_ -> B.continue st
For the results listbox
BL.handleListEventVi BL.handleListEvent
which gives us vi style navigation and uses the standard handleListEvent as the fallback, so that all the normal navigation (arrows) also work. where
doSearch st' =
liftIO $ searchHoogle (Txt.strip . Txt.concat $ BE.getEditContents (st' ^. stEditType))
And finally for handleEvent the doSearch function which calls the searchHoogle function (below) to search on the text from the type editbox.
-- | Search ahead for type strings longer than 3 chars.
searchAhead :: (BrickState -> IO [H.Target]) -> BrickState -> IO BrickState
searchAhead search st =
let searchText = Txt.strip . Txt.concat . BE.getEditContents $ st ^. stEditType in
if Txt.length searchText > 3
then do
-- Search
found <- search st
pure . filterResults $ st & stResults .~ found
& stSortResults .~ SortNone
else
-- Just clear
pure $ st & stResults .~ []
& stResultsList %~ BL.listClear
searchAhead is a helper function that searches hoogle as the user types. As long as there are more than three characters being searched for. Without this limit hoogle seems a bit slow on my machine because of the large number of results.
-- | Filter the results from hoogle using the search text
filterResults :: BrickState -> BrickState
filterResults st =
let allResults = st ^. stResults in
let filterText = Txt.toLower . Txt.strip . Txt.concat . BE.getEditContents $ st ^. stEditText in
let results =
if Txt.null filterText
then allResults
else filter (\t -> Txt.isInfixOf filterText . Txt.toLower $ formatResult t) allResults
in
st & stResultsList .~ BL.list ListResults (Vec.fromList results) 1
Filter the hoogle results by doing a sub-string search if the user has entered one
-- | Draw the UI
drawUI :: BrickState -> [B.Widget Name]
drawUI st =
[B.padAll 1 contentBlock]
where
contentBlock =
(B.withBorderStyle BBS.unicode $ BB.border searchBlock)
<=>
B.padTop (B.Pad 1) resultsBlock
resultsBlock =
let total = show . length $ st ^. stResults in
let showing = show . length $ st ^. stResultsList ^. BL.listElementsL in
(B.withAttr "infoTitle" $ B.txt "Results: ") <+> B.txt (showing <> "/" <> total)
<=>
(B.padTop (B.Pad 1) $
resultsContent <+> resultsDetail
)
resultsContent =
BL.renderList (\_ e -> B.txt $ formatResult e) False (st ^. stResultsList)
resultsDetail =
B.padLeft (B.Pad 1) $
B.hLimit 60 $
vtitle "package:"
<=>
B.padLeft (B.Pad 2) (B.txt $ getSelectedDetail (\t -> maybe "" (Txt.pack . fst) (H.targetPackage t)))
<=>
vtitle "module:"
<=>
B.padLeft (B.Pad 2) (B.txt $ getSelectedDetail (\t -> maybe "" (Txt.pack . fst) (H.targetModule t)))
<=>
vtitle "docs:"
<=>
B.padLeft (B.Pad 2) (B.txt $ getSelectedDetail (Txt.pack . clean . H.targetDocs))
<=>
B.fill ' '
searchBlock =
((htitle "Type: " <+> editor TypeSearch (st ^. stEditType)) <+> time (st ^. stTime))
<=>
(htitle "Text: " <+> editor TextSearch (st ^. stEditText))
htitle t =
B.hLimit 20 $
B.withAttr "infoTitle" $
B.txt t
vtitle t =
B.withAttr "infoTitle" $
B.txt t
editor n e =
B.vLimit 1 $
BE.renderEditor (B.txt . Txt.unlines) (BF.focusGetCurrent (st ^. stFocus) == Just n) e
time t =
B.padLeft (B.Pad 1) $
B.hLimit 20 $
B.withAttr "time" $
B.str (Tm.formatTime Tm.defaultTimeLocale "%H-%M-%S" t)
getSelectedDetail fn =
case BL.listSelectedElement $ st ^. stResultsList of
Nothing -> ""
Just (_, e) -> fn e
drawUI renders the state and creates the GUI. At first this may take some getting used to, but you will soon be able to see the GUI structure from the code.
<=>
means horizontal break, i.e. next “line”<+>
means “next to”Create small GUI fragments/“controls” and combine them with <+>
and <=>
B.txt
(B.txt
displays a Text, B.str
displays a string/[char])B.fill ' '
is used to get brick to fill to the maximum width (here 60) rather that having the right detail pain growing/shrinking as the data changes.
theMap :: BA.AttrMap
theMap = BA.attrMap V.defAttr [ (BE.editAttr , V.black `B.on` V.cyan)
, (BE.editFocusedAttr , V.black `B.on` V.yellow)
, (BL.listAttr , V.white `B.on` V.blue)
, (BL.listSelectedAttr, V.blue `B.on` V.white)
, ("infoTitle" , B.fg V.cyan)
, ("time" , B.fg V.yellow)
]
The attribute map is where attributes for the controls and custom attributes are defined. This makes it easy to change how the GUI looks. There is even support for themes and basic markup.
----------------------------------------------------------------------------------------------
-- | Compare two hoogle results for sorting
compareType :: H.Target -> H.Target -> Ordering
compareType a b =
compare (formatResult a) (formatResult b)
-- | Search hoogle using the default hoogle database
searchHoogle :: Text -> IO [H.Target]
searchHoogle f = do
d <- H.defaultDatabaseLocation
H.withDatabase d (\x -> pure $ H.searchDatabase x (Txt.unpack f))
-- | Format the hoogle results so they roughly match what the terminal app would show
formatResult :: H.Target -> Text
formatResult t =
let typ = clean $ H.targetItem t in
let m = (clean . fst) <$> H.targetModule t in
Txt.pack $ fromMaybe "" m <> " :: " <> typ
clean :: [Char] -> [Char]
clean = unescapeHTML . stripTags
-- | From hoogle source: https://hackage.haskell.org/package/hoogle-5.0.16/docs/src/General-Util.html
unescapeHTML :: [Char] -> [Char]
unescapeHTML ('&':xs)
| Just x <- Lst.stripPrefix "lt;" xs = '<' : unescapeHTML x
| Just x <- Lst.stripPrefix "gt;" xs = '>' : unescapeHTML x
| Just x <- Lst.stripPrefix "amp;" xs = '&' : unescapeHTML x
| Just x <- Lst.stripPrefix "quot;" xs = '\"' : unescapeHTML x
unescapeHTML (x:xs) = x : unescapeHTML xs
unescapeHTML [] = []
-- | From hakyll source: https://hackage.haskell.org/package/hakyll-4.1.2.1/docs/src/Hakyll-Web-Html.html#stripTags
stripTags :: [Char] -> [Char]
stripTags [] = []
stripTags ('<' : xs) = stripTags $ drop 1 $ dropWhile (/= '>') xs
stripTags (x : xs) = x : stripTags xs
The remainder of the code is non-brick code for searching and formatting hoogle results
Hopefully this example helps you get started with brick and demonstrates how easy brick makes creating terminal UIs
In my previous post I showed how I was managing exceptions by using a “wrapper” around a record of functions. In this post I’ll show how a free monad approach can be used to achieve the same goal and compare the two techniques.
This is the requirement for the example app, which is a pipeline of jobs
The idea was to have a record of operations, different implementations of this record are possible (e.g. run locally vs run in cloud). Then a function is called to create a wrapper function for each record field and a wrapper record is created. The functions in the wrapper record catch all synchronous exceptions and convert them to ExceptT.
The benefit of this approach was that more of the code could be written with pure functions without losing the ability to deal with exceptions that could occur at any point if the operations were specialized to IO.
data Operations m = Operations { opRead :: m Text
, opWrite :: Text -> m ()
}
data OperationsWrapper m = OperationsWrapper { opRead :: ExceptT OpsError m Text
, opWrite :: Text -> ExceptT OpsError m ()
}
mkOpsWrapper :: (MonadCatch m) => I2.Operations m -> OperationsWrapper m
mkOpsWrapper o = OperationsWrapper { opRead = E.ExceptT ((Right <$> I2.opRead o) `catch` readError)
, ...
}
where
readError :: (Monad m) => SomeException -> m (Either OpsError b)
readError e = pure . Left . ErrRead $ "Error reading: " <> show e
...
Here is roughly how it worked
There are many great articles on what free monads are and how they are implemented, see the links below for some of them. So I wont be going into detail about how they work, rather I’ll show how they can be used. But even if you’ve never used a free monad before, you may well be able to follow along with how I use them here.
A free monad way to build a monad from any functor. The rest of the article demonstrates why you might want to use them.
With a free monad you have a function that builds the free monad structure and one or more functions that interpret/run the AST.
As the image above illustrates createAst generates the AST. The AST can then be passed to different interpreters that run the AST. With the record based approach you varied the implementation by choosing which record of functions to pass in. Here you use a different interpreter over the same free monad output to vary the implementation. This results in a clean separation of concerns.
Note that you don’t need to use free monads to implement this pattern. You could create an AST using sum types and have interpreters that run that. The advantage of using free is that since it is monadic you get to use Haskell’s do
notation. This makes the code that generates the AST feel “natural”, it is a simple embedded domain specific language
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DeriveFunctor #-}
module Main where
import Control.Monad.Free
import Control.Monad.Free.TH
data OpsF m next = OpRead (Text -> next)
| OpWrite Text next
| OpLog Text next
| OpRun Text (Text -> m Text) Text (Text -> next)
deriving (Functor)
makeFree ''OpsF
type Ops m = Free (OpsF m)
I’m using template haskell and DeriveFunctor to do all the heavy lifting. I.e. it creates all the types that lift your operations into the Free monad. Not having to manually do this makes creating free monads pretty simple. If you have not used free before I’d suggest reading some of the articles I’ve linked to below to understand the detail, or you can just follow this pattern for now
next
is what enables the chainingE.g. for OpWrite
E.g. for OpRead
And here is an example using several of the DSL operations together
createAst :: (Monad m) => Text -> (Ops m) Text
createAst x = do
opLog $ "starting: " <> x
r <- opRead
opWrite $ r <> x
After running the createAst function what you have is an AST. opRead etc do nothing on their own. This is the magic of using free with do notation. We go to write normal, pure, code and we end up with an AST.
Given this AST it is possible to write different interpreters that act in various ways. E.g. one for testing, one for local, one for running in the cloud etc.
Yes, lets write an interpreter that, similar to the record wrapper approach, catches exceptions.
However before starting its worth reiterating a few points about exceptions from my previous post. Remember that it is usually a very bad idea to catch all exceptions as you may end up catching exceptions that you ought not to catch. See Exceptions best practices in Haskell for a good overview. There are several ways to ensure that you are only catch asynchronous exceptions. In these examples I’m going to be using the safe exceptions package which does exactly that.
Right, back to the code. In this example interpreterFile is a function that interprets the AST and uses a file to store/load the state
-- Make sure that the SafeException functions are used
import Protolude hiding (catch, throwIO)
import Control.Exception.Safe (catch, throwIO)
interpreterFile :: (Ops IO) Text -> ExceptT OpsError IO Text
interpreterFile o =
case o of
Pure a -> pure a -- no next action
(Free (OpRead n)) ->
do
r <- liftIO $ Txt.readFile "data.txt"
interpreterFile $ n r -- run next
`catch`
handler ErrRead
(Free (OpWrite t n)) ->
do
liftIO $ Txt.writeFile "data.txt" t
interpreterFile n -- run next
`catch`
handler ErrWrite
(Free (OpRun name fn t n)) ->
do
r <- lift $ fn t
interpreterFile $ n r -- run next
`catch`
handler (ErrRunning name)
(Free (OpLog t n)) -> do
putText $ "log: " <> t
interpreterFile n -- run next
where
handler :: (Monad m) => (Text -> OpsError) -> SomeException -> ExceptT OpsError m Text
handler ope e = throwE . ope $ show e -- catch exception and use ExceptT's throwE
The operations are run and any synchronous exception is caught and handled in the ExceptT. This looks pretty similar to the record based approach but I think is simpler.
Here is an interpreter for testing which uses a state monad to store/retrieve the state
data TestState = TestState { tstValue :: Text
, tstLog :: [Text]
} deriving (Show, Eq)
interpreterState :: (Ops (S.State TestState)) Text -> (S.State TestState) Text
interpreterState o =
case o of
Pure a -> do
modify (\s -> s { tstValue = a })
tstValue <$> get
(Free (OpRead n)) -> do
st <- S.get
interpreterState $ n (tstValue st)
(Free (OpWrite t n)) -> do
S.modify (\s -> s { tstValue = t } )
interpreterState n
(Free (OpRun _ fn t n)) -> do
r <- fn t
interpreterState $ n r
(Free (OpLog t n)) -> do
S.modify (\(TestState s ls) -> TestState s $ ls <> [t])
interpreterState n
Compare that to the previous approach’s tests
testPipeline :: [I2.Job (S.State Text)] -> Text -> S.State Text (Either I3.OpsError Text)
testPipeline jobs initial = do
let ops = I3.OperationsWrapper { I3.opRead = E.ExceptT $ do
r <- get
pure . Right $ r
The big advantage here is that the tests are no longer forced to use ExceptT. Each interpreter, for testing or otherwise, can use whatever stack is appropriate
As always there are trade offs, see the Free monad considered harmful article for example. While some of these issues can be address (e.g. see church encoding below) it is worth considering alternatives.
Personally, so far, I’ve found free to be a great fit for what I need (e.g. selecting implementation not based on type), but its definitely worth deciding on a case by case basis
The Control.Monad.Free.Church package handles church encoding of a free monad. This can be important to do because, as it says in Control.Monad.Free.Church:
Even if the Haskell runtime optimizes some of the overhead through laziness and generational garbage collection, the asymptotic runtime is still quadratic. On the other hand, if the Church encoding is used, the tree only needs to be constructed once.
Given how easy this package makes church encoding, and how bad O(n^2) performance can be, it is almost always a good idea to do the encoding.
(I originally found getting the types correct for Church encoding a bit tricky. This Free monad and church encoding example helped clear up a lot of the confusion for me. Be sure to look at it as well if my explanation below does not help you).
To get Church encoding, the only requirement is that you use a MonadFree constraint rather than your more specific data type for the function that generates the DSL.
In the example above createAst looked like this.
The problem is that I’ve used the “Ops m
” type, rather than MonadFree.
Here is what it should look like
The important parts being
MonadFree (...) a
”This is how it would be run without Church encoding
--------------------------------------------------
-- Example in IO with exception
--------------------------------------------------
let ioJobs = [ Job "j1" ioJob1
, Job "j2" ioJob2
, Job "j3" ioJob3
]
a <- runExceptT $ interpreterFile $ createAst "test1" ioJobs
print a
And this is how its run with Church encoding using improve
from Control.Monad.Free.Church
--------------------------------------------------
-- Example in IO with exception
--------------------------------------------------
let ioJobs = [ Job "j1" ioJob1
, Job "j2" ioJob2
, Job "j3" ioJob3
]
-- Note that createAst must be run inline here to avoid an error about the monad constraints
ai <- runExceptT $ interpreterFile (C.improve $ createAst "test1" ioJobs)
print ai
That is all it takes, we can now use free without O(n^2) concerns
Free monads give us a nice way to separate pure and impure code while also handling exceptions. Overall I think this approach is more flexible and easier to read that the record of functions approach.
(TL;DR: I use a record of functions & a record of wrapper functions that catch all synchronous exceptions and convert to ExceptT for the pure code)
See also the followup post on using Free as an alternative
This is a quick overview of how I refactored one of my first production haskell application that had a majority of IO code to be more pure. I’ve seen several approaches with different trade offs but none that fit exactly what I was doing 100%.
The design I needed was a pipeline of actions that needs to be performed, where each job in the pipeline is considered user code (think plugin)
Some design notes
Both jobs and the pipeline should be testable
A job has a name and an IO action to perform. In this demo a job takes a text value, does some transformation and returns a resulting text value. In a real world application this record would contain more operations and probably be polymorphic
Operations is a record of operations that the pipeline uses to persist job results. I.e. this is the pluggable storage and related functions
I’ve chosen to pass a record of functions rather than using a type class. For discussion on records vs typeclass for this type of design see
A pluggable set of operations feels more natural as a record of functions to me than a typeclass. For me this is because they are just a named set of functions rather than something that has a set of rules (“an algebra”).
runPipeline :: Operations -> Text -> [Job] -> IO Text
runPipeline ops init jobs = do
opWrite ops init
id <- foldlM runJob 0 jobs
putText $ "\nfinal job id = " <> show id
opRead ops
where
runJob (id :: Int) (Job name fn) = do
putText $ "running job: " <> name
prev <- opRead ops
r <- fn prev
opWrite ops r
putText $ " = " <> r
putText " ----"
pure $ id + 1
Run each job in order, for each job
(Obviously I could simply pass the previous state as part of the fold and do away with the load and store. Its done this way for the demonstration)
import Protolude
import qualified Data.Text.IO as Txt
import qualified Step1.Impl1 as I
readFileOp :: FilePath -> IO Text
readFileOp = Txt.readFile
writeFileOp :: FilePath -> Text -> IO ()
writeFileOp = Txt.writeFile
mkFileOps :: FilePath -> I.Operations
mkFileOps p =
I.Operations { I.opRead = readFileOp p
, I.opWrite = writeFileOp p
}
Here the operations are implemented by reading and writing to a file, using Data.Text.IO. The example code in github also has a STM backed storage implementation
job1 :: Text -> IO Text
job1 v = do
putText "in job1"
pure $ "1:" <> v
job2 :: Text -> IO Text
job2 v = do
putText "in job2"
void . throwIO $ DemoException "oops"
pure $ "2:" <> v
job3 :: Text -> IO Text
job3 v = do
putText "in job3"
pure $ "3:" <> v
newtype DemoException = DemoException Text
instance Show DemoException where
show (DemoException s) = Txt.unpack s
instance Exception DemoException
These are three example jobs. Notice that the second one explicitly throws an exception. A job can explicitly throw an exception like this or it could throw an exception on failure (e.g. network error)
import Protolude hiding (catch, throwIO)
import qualified Step1.Impl1 as I1
import qualified Step1.Storage1 as S1
main :: IO ()
main = do
let jobs = [ I1.Job "j1" job1
, I1.Job "j2" job2
, I1.Job "j3" job3
]
let ops = S1.mkFileOps
r <- I1.runPipeline ops "0" jobs
putText r
When run this will fail with an exception in job 2 and result in the application terminating
---------
Demo1: in memory
---------
running job: j1
in job1
= 1:0
----
running job: j2
in job2
Exception: oops
So far the example has a working pipeline but everything is in IO and it does nothing about exceptions.
Lets remove some IO constraints.
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Step2.Impl2 ( Operations (..)
, Job (..)
, runPipeline
) where
import Protolude
data Operations m = Operations { opRead :: m Text
, opWrite :: Text -> m ()
, opLog :: Text -> m ()
}
data Job m = Job { jobName :: Text
, jobFn :: Text -> m Text
}
runPipeline :: (Monad m) => Operations m -> Text -> [Job m] -> m Text
runPipeline ops init jobs = do
opWrite ops init
id <- foldlM runJob 0 jobs
opLog ops $ "\nfinal job id = " <> show id
opRead ops
where
runJob (id :: Int) (Job name fn) = do
opLog ops $ "running job: " <> name
prev <- opRead ops
r <- fn prev
opWrite ops r
opLog ops $ " = " <> r
opLog ops " ----"
pure $ id + 1
Here I’ve removed all explicit IO actions for the pipeline types and functions
opRead has changed from
to
i.e. the type no longer explicitly says IO but now accepts any kind * -> * (e.g. any monad)
The Operations type has changed from
to
runPipeline has changed from
to
There is a constraint saying that m must be a monad. This is so that we can use the monadic type class (pure, >>= etc)
And we use the Operations m and Job m types
These changes mean that the pipeline and its types no longer require IO
The storage implementations need IO so you specialize the m to IO. Apart from that nothing changes I.e.
readFileOp :: FilePath -> IO Text
readFileOp = Txt.readFile
writeFileOp :: FilePath -> Text -> IO ()
writeFileOp = Txt.writeFile
mkFileOps :: FilePath -> Operations IO
mkFileOps p =
Operations { opRead = readFileOp p
, opWrite = writeFileOp p
, opLog = putText
}
Notice that I added opLog. It is used for logging rather than calling putText etc which can not be done since there is no IO (or MonadIO constraint)
import Protolude hiding (catch, throwIO)
import qualified Step2.Storage2 as S2
import qualified Step2.Impl2 as I2
main :: IO ()
main = do
let jobs = [ I2.Job "j1" job1
, I2.Job "j2" job2
, I2.Job "j3" job3
]
let ops = S2.mkFileOps
r <- I2.runPipeline ops "0" jobs
putText r
Note that we can use the same jobs as we did in step 1 (job1, job2 and job3).
This simple change has already resulted in a few nice improvements.
That is a pretty good for not much extra work. However I have done nothing about exceptions and when this code runs it still breaks with an exception as it did before.
If you have not seen the Exceptions best practices in haskell article, I think its worth looking at before continuing. The two points from the article I want to address are
Not only must we accept that jobs can fail with exceptions but we should treat this as normal. I.e. when someone is writing a IO job they should be able to throw exceptions and the pipeline should handle them.
However I don’t want the pure code to have to deal with exceptions, I’d much rather deal with an Either type there. Which may seem to contradict point 1 from the article above. But this is not the case, remember that jobs perform IO and thus should deal with exceptions and the pipeline is pure and should not, so ExceptT/Either is fine.
And so we need something to bridge the gap between the two worlds.
data OpsError = ErrRead Text
| ErrWrite Text
| ErrLogging Text
| ErrRunning Text
deriving (Show, Eq)
data OperationsWrapper m = OperationsWrapper { opRead :: ExceptT OpsError m Text
, opWrite :: Text -> ExceptT OpsError m ()
, opLog :: Text -> ExceptT OpsError m ()
, opRun :: (Text -> m Text) -> Text -> ExceptT OpsError m Text
}
OperationsWrapper is a record with a wrapper function for each function from the Operations record. Each function will run the corresponding Operation function, catch any synchronous exception and covert it into an ExceptT transformer type. (See the section on catching all exceptions below)
The code for the wrapper is not terribly pretty, but I feel its a fair price to pay for separating the two concerns without resorting to any higher level magic (that I’m not comfortable with yet)
mkOpsWrapper :: (MonadCatch m) => I2.Operations m -> OperationsWrapper m
mkOpsWrapper o =
OperationsWrapper { opRead = E.ExceptT ((Right <$> I2.opRead o) `catch` readError)
, opWrite = \t -> E.ExceptT ((Right <$> I2.opWrite o t) `catch` writeError)
, opLog = \t -> E.ExceptT ((Right <$> I2.opLog o t) `catch` logError)
, opRun = \fn t -> E.ExceptT ((Right <$> fn t) `catch` logError)
}
where
readError :: (Monad m) => SomeException -> m (Either OpsError b)
readError e =
pure . Left . ErrRead $ "Error reading: " <> show e
writeError :: (Monad m) => SomeException -> m (Either OpsError b)
writeError e =
pure . Left . ErrWrite $ "Error writing: " <> show e
logError :: (Monad m) => SomeException -> m (Either OpsError b)
logError e =
pure . Left . ErrLogging $ "Error logging: " <> show e
Each function works as follows
Call the “wrapped” corresponding function and on success return the result as a Right value
Catch any exception and return it as a Left OpsError
runPipeline :: (Monad m) => OperationsWrapper m -> Text -> [I2.Job m] -> m (Either OpsError Text)
runPipeline ops init jobs = runExceptT $ do
opWrite ops init
id <- foldlM runJob 0 jobs
opLog ops $ "\nfinal job id = " <> show id
opRead ops
where
runJob (id :: Int) (I2.Job name fn) = do
opLog ops $ "running job: " <> name
prev <- opRead ops
r <- opRun ops fn prev -- don't just lift, use opRun
opWrite ops r
opLog ops $ " = " <> r
opLog ops " ----"
pure $ id + 1
The changes from Step2 are
Also notice that I added a opRun function to wrap the running of the job. If you just lifted the job’s run function, then the exception would not be handled. So we need a wrapper function for this as well.
import Protolude
import qualified Step2.Storage2 as S2
import qualified Step2.Impl2 as I2
import qualified Step3.Impl3 as I3
main :: IO ()
main = do
let jobs = [ I2.Job "j1" job1
, I2.Job "j2" job2
, I2.Job "j3" job3
]
let ops = S2.mkFileOps
r <- I3.runPipeline (I3.mkOpsWrapper ops) "0" jobs
case r of
Right x -> putText $ "Success: " <> x
Left e -> putText $ "Exception: " <> show e
When run this will catch the exception in job 2 and correctly report the error, i.e. no runtime failure
Demo3: use file
---------
running job: j1
in job1
= 1:0
----
running job: j2
in job2
Exception: ErrLogging "Error logging: oops"
For completeness here is an example of how the pipeline can be tested using only pure code. I’m using the State monad (which you may not want to use in production but for my test and this example its just fine). Skip this the section if you are not interested in the tests
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
module PipelineSpec where
import Protolude
import Test.Hspec
import qualified Control.Monad.Except as E
import qualified Control.Monad.State.Strict as S
import qualified Step2.Impl2 as I2
import qualified Step3.Impl3 as I3
spec :: Spec
spec = do
describe "simple pipeline" $ do
it "should run in correct order" $ do
let jobs = [ I2.Job "j1" job1
, I2.Job "j2" job2
]
let (r, _) = S.runState (testPipeline jobs "0") ""
r `shouldBe` (Right "2:1:0")
testPipeline :: [I2.Job (S.State Text)] -> Text -> S.State Text (Either I3.OpsError Text)
testPipeline jobs initial = do
let ops = I3.OperationsWrapper { I3.opRead = E.ExceptT $ do
r <- get
pure . Right $ r
, I3.opWrite = \t -> E.ExceptT $ do
put $ t
pure . Right $ ()
, I3.opRun = \fn t -> E.ExceptT $ do
r <- fn t
pure . Right $ r
, I3.opLog = \t -> E.ExceptT . pure . Right $ ()
}
I3.runPipeline ops initial jobs
job1 :: Text -> (S.State Text) Text
job1 v = pure $ "1:" <> v
job2 :: Text -> (S.State Text) Text
job2 v = pure $ "2:" <> v
Catching all exceptions is generally considered to be a bad idea. See for example the docs for Control.Exception.
For a thorough discussion of alternatives see Catching all exceptions from www.schoolofhaskell.com
In this example I am using the safe-exceptions package. The catch function comes from Control.Exception.Safe not Control.Exception so only synchronous exceptions are caught. Take a look at the source code to see the cabal file and the explicit imports from Control.Exception.Safe
Another approach that @thumphriees pointed out to me on twitter (and is discussed in the “Catching All Exceptions” article) is to use the async library to help with exceptions. This is the approach I’ll probably be using with the production version of this code as it for almost no extra cost gives me simple timeout and cancellation control.
As you can see from the code below this is a pretty simple to use
demoAsyncCatch :: IO ()
demoAsyncCatch = do
r <- async jobBad >>= waitCatch
case r of
Right _ -> putText "demo async - Right"
Left e -> putText $ "demo async - Left: " <> show e
where
jobBad = do
putText "in jobBad"
void . throwIO $ DemoException "oops"
The obvious downside to this method is that you end up with boilerplate code for the wrapper. You’ll have to decide on if this is a problem for you or not. For me this is working really well so far.
This is a quick overview of how you can use PureScript on a RaspberryPi to do GPIO
The version of NodeJs available in most of the distros is quite old. There are a few ways to get a new version. Personally I found that installing the same version of node that I have on my desktop on the pi worked best. To do this get the version you need from https://nodejs.org/dist/.
Then run the commands with the appropriate version numbers
if you prefer to get the latest, and a more automated install, then you can follow the instructions here Beginner’s Guide to Installing Node.js on a Raspberry Pi
PureScript itself does not seem to install on the pi, so you’ll need to compile on a desktop machine and copy the resulting JavaScript files across
GPIO can quite easily be done using unix files. This is not particularly fast but it should be more that sufficient for many use cases. It is also simple enough that it can even be done directly from the shell. If you need faster access FFI to one of the broadcom libraries is probably the way to go.
The GPIO ports are exposed here
To use a GPIO pin for basic IO you need to do the following
/sys/class/gpio/export
/sys/class/gpio/gpioNN/direction
(where NN is the pin number)/sys/class/gpio/gpioNN/value
/sys/class/gpio/gpioNN/value
The Raspberry Pi pin numbers are a little confusing. There are different models of the pi (A, B, B+, rev 2 etc). There are also different numbering schemes i.e. pin numbers, gpio numbers, wiring pi numbers…
When you are looking at existing code or examples make sure you know which scheme is being used.
Here is a good reference showing pin number and GPIO numbers
As shown above the physical pin numbers are used when working with GPIO, so all the functions here work with a pin number (Pin newtype). Then there is a module per board that can be used to map from a logical GPIO number to a pin. In this example I’ve only defined the layout for the Rev 2 A & B P1 boards.
import Gpio (Pin (..))
data GpioPin = GpioPin2
| GpioPin3
| GpioPin4
| GpioPin17
| GpioPin27
| GpioPin22
| GpioPin10
| GpioPin9
| GpioPin11
| GpioPin14
| GpioPin15
| GpioPin18
| GpioPin23
| GpioPin24
| GpioPin25
| GpioPin8
| GpioPin7
toPin :: GpioPin -> Pin
toPin g =
case g of
GpioPin2 -> Pin 3
GpioPin3 -> Pin 4
GpioPin4 -> Pin 7
GpioPin17 -> Pin 11
GpioPin27 -> Pin 13
GpioPin22 -> Pin 15
GpioPin10 -> Pin 19
GpioPin9 -> Pin 21
GpioPin11 -> Pin 23
GpioPin14 -> Pin 8
GpioPin15 -> Pin 10
GpioPin18 -> Pin 12
GpioPin23 -> Pin 16
GpioPin24 -> Pin 18
GpioPin25 -> Pin 22
GpioPin8 -> Pin 24
GpioPin7 -> Pin 26
To “open” a port you write the port number to the export file. This function first checks if the port is open
open :: Pin -> forall e. Eff (fs :: FS, err :: EXCEPTION | e) Unit
open (Pin pin) = do
e <- S.exists $ "/sys/class/gpio/gpio" <> show pin
if not e
then
S.writeTextFile E.ASCII "/sys/class/gpio/export" (show pin)
else
pure unit
Set a port to be in or out. Remember that you can read a value from a port that is set to be output, you read if the port was set to on or off. Reading from a port set to in means that you are reading the value from the pin, i.e. is the pin being held high or low by an external input.
setDirection :: Pin -> Direction -> forall e. Eff (fs :: FS, err :: EXCEPTION | e) Unit
setDirection (Pin pin) dir = do
let dirStr = case dir of
In -> "in"
Out -> "out"
S.writeTextFile E.ASCII "/sys/class/gpio/gpio18/direction" dirStr
Set an output pin to high or low
setValue :: Pin -> Boolean -> forall e. Eff (fs :: FS, err :: EXCEPTION | e) Unit
setValue (Pin pin) on = do
let val = if on then "1" else "0"
S.writeTextFile E.ASCII ("/sys/class/gpio/gpio" <> (show pin) <> "/value") val
Read the current value from the pin
getValue :: Pin -> forall e. Eff (fs :: FS, err :: EXCEPTION | e) Boolean
getValue (Pin pin) = do
val <- (Str.trim <$> S.readTextFile E.ASCII "/sys/class/gpio/gpio18/value")
pure $ val /= "0"
main :: forall e. Eff (console :: CONSOLE, fs :: FS, err :: EXCEPTION | e) Unit
main = do
let pin = R.toPin R.GpioPin24
R.open pin
v <- R.getValue pin
R.setDirection pin R.Out
R.setValue pin $ not v
You then need to copy everything from the output folder to a folder on the pi
Finally copy your compiled bundle across
Note that you only need to copy the output folder again when you add more imports. Otherwise just copy your bundle to save time.
This is part two in a two part blog series about haskell terminal applications, this blog shows a simple text classification implementation using techniques from part one.
There are many ways to classify documents ranging from simple to very complex. The algorithm I’m using here is called Tf-Idf or “term frequency / inverse document frequency”. There are a number of sites that explain how it works better in detail than I would. See for example
Basically Tf-Idf counts the number of times a term occurs (term frequency) and combines that with a negative weighting for the number of times the term occurs in all categories. This means that common words that exist in multiple categories are going to count less towards the final score.
There are also multiple ways that TfIdf itself can be implemented e.g. with different algorithms for weighting the Tf vs the Idf or using n-grams (where n > 1). I’m going with a pretty simple implementation but even with that I’ve seen pretty accurate results with the classifications I’m doing. I’m primarily using this for classifying short sentences of text. So it has been tested for simple matching on relatively small documents.
You can get the source for TextClassify at https://github.com/andrevdm/TextClassify. The code is reasonably well commented IMO, so I wont go into too much detail here on every line of code
Below I’ll discuss some implementation details not covered by the code comments.
There are a large number of existing terminal applications so it often makes sense to use this existing functionality as well as writing terminal applications so that they too can be reused.
The CSV files I work with have a header that needs to be removed. Here is a awk script (removePrefix.awk) to do that
This script can be used to pre-process the CSV file
The higher the quality of the input data to the classification algorithm the better the results will be. Some regular expressions can clean up the input text nicely. Here is a sed script that does this
#!/bin/sed -uf
s/c\*/ /gi
s/jan\|feb\|mar\|apr\|may\|jun\|jul\|aug\|sep\|oct\|nov\|dec/ /gi
s/ \+$//gi
s/\[\(\)!\-\/*\\\]/ /g
s/[\.*\/\(\)_,\]/ /g
s/-/ /g
s/\t/ /g
s/ \+/ /g
s/ \+$//gi
s/^ \+//gi
This sed script removes some common words (the months), removes special characters and multiple spaces. You can customise this or create one per type of input as required. The -u parameter is important as it disables buffering which may interfere with line-by-line processing.
The TextClassification application will start sed and keep it running. A line of input data will be passed to it and the result read back a line at a time.
column
can be used to show CSV data as an aligned table in the terminal. I’ll use this later to show the results of the classification.
See Args.hs
As part one showed I’m using OptParse-generic to parse the command line arguments.
data Arguments = Arguments {train :: Text <?> "Path to training data"
,input :: Maybe Text <?> "Input file to categorise. If missing stdin will be used"
,parser :: Maybe Text <?> "Parser type, defaults to lines. Options are lines/detail/csv"
,popts :: Maybe Text <?> "Parser options"
,clean :: Maybe Text <?> "Options name of text cleaner - see docs"
} deriving (Generic, Show)
instance ParseRecord Arguments
This is the resulting help text from these arguments
Usage: txtcls --train STRING [--input STRING] [--parser TEXT] [--popts TEXT]
[--clean TEXT]
Available options:
-h,--help Show this help text
--train TEXT Path to training data
--input TEXT Input file to categorise. If missing stdin will be
used
--parser TEXT Parser type, defaults to lines. Options are
lines/detail/csv
--popts TEXT Parser options
--clean TEXT Options name of text cleaner - see docs
These arguments are then interpreted and stored in the Options type
data Options = Options {trainingPath :: Text
,parserType :: Text
,parserOptions :: Maybe Text
,txtCleaner :: Text -> IO Text
,hin :: Handle
,hout :: Handle
}
hin is set to the handle of the input stream, stdin if no --input
parameter is present else the handle for the file
hin_ <- case unHelpful $ input args of
Just t ->
openFile (Txt.unpack t) ReadMode
Nothing ->
pure stdin
Above I showed a sed that could be used to clean the input text. However because this application can use a CSV as the input it can’t simply apply the cleaning to the entire file or even an entire line. Only the text being classified should be cleaned. To do this an instance of sed is started and fed the text to clean one line at a time. (Actually any app could be used as long as it reads and writes one line at a time). The name of the app / script to use is defined by the --clean
parameter
-- | Build a 'cleaner'
getCleaner :: Maybe Text -> IO (Text -> IO Text)
getCleaner mcmd =
case mcmd of
-- | The cleaner uses the extenal process to do the actual cleaning. One line is writtent to the processes' stdin and then a value is read from its stdout
Just cmd -> do
(Just inp, Just outp, _, phandle) <- createProcess (proc (Txt.unpack cmd) []) { std_out = CreatePipe, std_in = CreatePipe }
hSetBuffering outp NoBuffering
hSetBuffering inp LineBuffering
pure $ cleanText inp outp
-- | No external cleaner. Just make the text lower case
Nothing ->
pure $ pure . Txt.toLower
-- | Used by getCleaner to build a curried cleaner function
cleanText :: Handle -> Handle -> Text -> IO Text
cleanText inp outp txt = do
hPutStrLn inp $ Txt.unpack (Txt.toLower txt)
pure . Txt.pack =<< hGetLine outp
The getCleaner function is passed (the optional) name of the cleaner script. If a script was specified then a processes is started and a curried cleanText function is returned as the cleaning function. If no script was specified then the returned cleaning function simply performs a toLower on the text.
cleanText writes a line to the input handle for the process and then immediately reads the response line.
TextClassifier has three parsers
whileM_ is used to read a line of input at a time. The line is then passed to the appropriate parsers, i.e. CSV, line or detail.
-- | Read input a line at a time and pass it to the parser
whileM_ (not <$> IO.hIsEOF (Args.hin opts)) $ do
-- | line of data
origChars <- IO.hGetLine $ Args.hin opts
let origLine = Txt.pack origChars
-- | parse the line and get the results to display
parsed <- case parser of
---
See ClassifyCsv.hs and Classify.hs
I’m using Cassava to read the CSV file as well as creating the output csv. Since I’m not interpreting any of the data apart from the text to be classified I’m simply reading the CSV as a vector of Text.
let contents = BL8.pack . Txt.unpack $ line in
let parsed = decode NoHeader contents :: Either [Char] (V.Vector [Text]) in
Given a vector of Text it is simple to get the column containing the text to classify. The parseCsvLine function returns a ParsedLine a type which contains the text to be classified.
newtype RawText = RawText Text deriving (Show)
data ParsedLine a = ParsedLine RawText a deriving (Show)
Remember that each line of data must be cleaned. Rather than having parseCsvLine live in IO it returns a ParsedLine a type. The code in Main then calls the cleaner and passes the resulting CleanedLine a to categoriseCsvLine. This limits the amount of code that needs to be in IO. It also make the code easier to test (e.g. from the REPL) as the two functions can be tested independently.
See ClassifyIO.hs
The training set is a directory with a file per category. Each file contains the words for that category. To load the files the loadTrainingSet function is used
newtype Category = Category Text deriving (Show, Eq, Ord)
data TrainingSet = TrainingSet [(Category, [Text])] deriving (Show)
loadTrainingSet :: Args.Options -> FilePath -> IO TrainingSet
All .txt files in the directory are loaded and result in a category of words.
See TfIdf.hs
To review the terminology
the train function takes a TrainingSet and creates a TrainedData
-- | A term is a single word
newtype Term = Term Text deriving (Show, Eq, Ord)
-- | A category name
newtype Category = Category Text deriving (Show, Eq, Ord)
-- | Term frequency value
newtype Tf = Tf Double deriving (Show)
-- | Inverse document frequency value
newtype Idf = Idf Double deriving (Show)
-- | The combined Tf and Idf value
newtype TfIdf = TfIdf Double deriving (Show)
-- | A document is a map of terms to TfIdf
newtype Document = Document (Map Term TfIdf) deriving (Show)
-- | Data making up the training set
data TrainingSet = TrainingSet [(Category, [Text])] deriving (Show)
-- | The trained data, each category linked to a document
data TrainedData = TrainedData [(Category, Document)] deriving (Show)
train :: TrainingSet -> TrainedData
Categorising text is handled by the categorise function. Given a collection of words it returns the best matching category if one was found. classifyDetail returns all possible matches sorted best match first. Both functions use cagegoriseWords to do the actual classification.
-- | Classify a line of text and try get the best matching category
classify :: Args.Options -> TrainedData -> Text -> Maybe (Category, Double) -- In Classify.hs
-- | Classify a line of text and get all matching categories, best first
classifyDetail :: TrainedData -> Text -> [(Category, Double)] -- In Classify.hs
categoriseWords :: TrainedData -> [Text] -> [(Category, Double)] -- in TfIdf.hs
To calculate the Tf and the Idf values the following two functions are used
-- | Calgulate the term frequency for a collection of words
-- | Tf = occurrence / terms in document.
calcTermFreq :: [Text] -> Map Term Tf
calcTermFreq terms =
-- | Map of term to number of occurrences
let freq = Map.fromListWith (+) [(Term t, 1) | t <- terms] in
-- | Document of Term to freq. Tf = occurrence count / terms in doc
(\d -> Tf $ d / fromIntegral(length terms)) <$> freq
-- | Claculate the term's inverse document frequency
-- | Idf = (tf + 1) / (number of documents + 1)
-- | + 1 is used to avoid divide by zero
calcTermIdf :: [Map Term a] -> Term -> Idf
calcTermIdf termToTfs term =
let docsWithTerm = filter identity (Map.member term <$> termToTfs) in
Idf $ log ((fromIntegral . length $ termToTfs) + 1) / ((fromIntegral . length $ docsWithTerm) + 1)
Notice that there is no need for IO at all in the TfIdf module. It is given a loaded training set and cleaned text to classify.
The classification is then just finding the category with the closest matching tf-idf value
-- | Calculate how well terms matches categories
compareToCategory :: [(Term,TfIdf)] -> (Category, Document) -> (Category, Double)
compareToCategory searchTfIdf (cat, Document catMap) =
let catList = Map.toList catMap in
-- | common words in the category and the search text
let common = Lst.intersectBy sameTerm catList searchTfIdf in
let commonV = sum $ valFromTfIdf . snd <$> common in
-- | Sum of all the TfIdf values
let allV = sum (valFromTfIdf . snd <$> searchTfIdf) + sum (valFromTfIdf . snd <$> catList) in
-- | Similarity = ((common a) + (common b)) / (sum all tfIdf)
(cat, (commonV * 2) / allV)
This will install txtcls into your local stack bin folder.
txtcls - Text Classifier. Version 0.1.2
Usage: txtcls --train TEXT [--input TEXT] [--parser TEXT] [--popts TEXT]
[--clean TEXT]
Available options:
-h,--help Show this help text
--train TEXT Path to training data
--input TEXT Input file to categorise. If missing stdin will be
used
--parser TEXT Parser type, defaults to lines. Options are
lines/detail/csv
--popts TEXT Parser options
--clean TEXT Options name of text cleaner - see docs
The examples folder contains scripts showing how txtcls can be used. The files are
Where
--train ./trainingData
--input egLines.txt
--parser lines
--clean ./cleanText.sed
Where
--parser detail
{.bash}
Where
--popts 2
{.bash}
| column -s , -t
{.bash}
skipLines.awk egCsvWithHeader.csv | txtcls --train ./trainingData --parser csv --popts 2 --clean ./cleanText.sed | column -s , -t
Where
./skipLines.awk egCsvWithHeader.csv |
--input
paramter so the input is read from stdin (here the output of awk)The source code for TextClassify is available and commented. You should hopefully be able to look at it and understand how it was implemented.
The important things to notice are
This is part one in a two part blog series about haskell terminal applications. In this blog I’ll cover some techniques for writing a haskell application that behaves well as a shell application. In part two I’ll show a simple text classification implementation using these techniques.
OptParse-generic makes parsing command line arguments easy. Doing this manually is tedious and not terribly interesting so its great to have a simple library that handles this well.
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeOperators #-}
data Arguments = Arguments {train :: Text <?> "Path to training data"
,input :: Maybe Text <?> "Input file to categorise. If missing stdin will be used"
,parser :: Maybe Text <?> "Parser type, defaults to lines. Options are lines/detail/csv"
,popts :: Maybe Text <?> "Parser options"
,clean :: Maybe Text <?> "Options name of text cleaner - see docs"
} deriving (Generic, Show)
instance ParseRecord Arguments
getArgs :: IO Arguments
getArgs = do
args <- getRecord "Your app name here."
pure (args :: Arguments)
The <?>
operator here lets you specify help text for each argument. Running your app with –help will print the help message using this text
Usage: appName --train STRING [--input STRING] [--parser TEXT] [--popts TEXT]
[--clean TEXT]
Available options:
-h,--help Show this help text
--train TEXT Path to training data
--input TEXT Input file to categorise. If missing stdin will be
used
--parser TEXT Parser type, defaults to lines. Options are
lines/detail/csv
--popts TEXT Parser options
--clean TEXT Options name of text cleaner - see docs
It is often useful to allow terminal apps to get their input data either from an input file or have it piped to the app (stdin). System.IO
defines a set of functions for reading and writing IO that all take an explicit handle. For example hGetLine
System.IO also defines the stdin, stdout and stderr standard IO handles.
This means that you can pass either stdin or a file handle to hGetLine and it will work the same.
In the example arguments above I’ve allowed the user to specify an input file by using the –input option. If that is missing stdin is used.
handle <- case unHelpful $ input args of
Just t ->
openFile (Txt.unpack t) ReadMode
Nothing ->
pure stdin
Notice that unHelpful
is called to get the value from a record field with a description
While you could use the parsed data directly I chose to rather create another record populated from the parsed command line arguments. This allows me to have names tailored for command line in one record and for my code in another. In this interpreted record I also store the handle to use for input
getOptions :: IO Options
getOptions = do
args <- getRecord "Your app name here."
hin_ <- case unHelpful $ input args of
Just t ->
openFile (Txt.unpack t) ReadMode
Nothing ->
pure stdin
pure Options {trainingPath = unHelpful (train args)
,parserType = fromMaybe "lines" $ unHelpful (parser args)
,parserOptions = unHelpful (popts args)
,hout = stdout
,hin = hin_
}
It is often desirable to allow terminal applications to process and respond to a single line of data at a time (e.g. sed). There are several ways to do this in haskell. One of the simplest it to use whileM_
and check for EOF.
import System.IO
--where inputH is the input handle stdin or a file as above
hSetBuffering inputH LineBuffering
whileM_ (not <$> hIsEOF inputH) $ do
line <- hGetLine inputH
--
Using an existing terminal application is a quick way to leverage existing functionality. For example you may want to use sed to manipulate some text. This is fairly strait-forward in haskell.
import System.IO
import System.Process
(Just inp, Just outp, _, phandle) <- createProcess (proc "command_name_here" []) { std_out = CreatePipe, std_in = CreatePipe }
hSetBuffering outp NoBuffering
hSetBuffering inp LineBuffering
You now have and an input (inp) and output (outp) handle for the application. If the application supports line at a time input from stdin you can simply write your data and read the result back. Alternatively you may need to write the entire contents and wait for a result.
See part two to see these techniques in use in a text classification application
I’ve always found creating tables that need rowspans a bit awkward. Having to skip TDs in the rows means that the logic to create each row can get a bit ugly. Typically I just want to map over an array of and get an array of TDs. Fortunately its quite easy to create a function to help make this easier. To keep it interesting I’m going to implement it in PureScript and JavaScript and compare a Haskell version of the function too.
As an example imagine I have a set of data to display that shows the days an event occurred per month per day. Starting at the end, this is what I want the result to look like. You can see that the each year spans over the months + days and each month spans over the days.
Here is how the data could be represented in PureScript using records
let dates = [{year:2016
,months:[{month:11
,days:[{day:30}
,{day:10}
]
}
,{month:2
,days:[{day:15}
,{day:3}
,{day:1}
]
}
]
}
,{year:2015
,months:[{month:5
,days:[{day:20}
,{day:17}
]
}
]
}
]
Now we need something to map this data to. The simplest thing to do is to map it to a structure where each row contains the same number of columns and have a secondary function manage the shenanigans of omitting TDs that are spanned over.
Here is a definition for the row and column types
Each row has columns and each column has a span. PureScript has extensible record types that support row polymorphism. What that means here is that we can define a function that accepts these types but the data actually passed to the function can be a superset of the required values. React for example needs a unique element id and obviously each column needs a value to be displayed. Rather than forcing this on the user of the rowSpan algorithm however we just get PureScript to check for the actual fields we need and let the caller add whatever extra data they need.
Here is a function to take the data above and creates an array of Rows.
let rowData = concat $ concat $ map
(\y -> map
(\m -> map
(\d -> {rid:y.year * 100 + m.month
,cols:[{val:show y.year, span:length $ concat $ map (\a -> a.days) y.months}
,{val:show m.month, span:length m.days}
,{val:show d.day, span:1}
]
})
m.days)
y.months)
dates
The result of this is an array of Rows. I.e. the data has been flattened and looks something like this
Each Row has a rid (row id) field. Each column has a val (value) and a span field. The Col’s val contains the text to display and the span is the number of columns to span. This is simple to calculate. For months its the number of days in that month, for a year its the total number of days in all months in the year.
Given a grid of data (Array of Row) we need a function that can remove the cols that are to be spanned over. This result can then be mapped over to generate the required TRs and TDs.
Effectively the function must do this
Here is one way to achieve this
In the image below you can see the grid with the initial collapse value per row. As above notice that a column is removed unless the collapse value is zero.
And here is the PureScript code to do this
collapseTableArray :: forall r c. Array (Row r c) -> Array (Row r c)
collapseTableArray rows =
-- | To create the initial collapse array, we need to know the number of cols in a row
-- | Get the number of cols in each row and then get the minimum value
-- | PureScript being safe ensures that the empty list case is handled
case minimum $ (\r -> length r.cols) <$> rows of
Just m ->
-- | Initial collapse array of zeros
let collapse = replicate m 0 in
-- | fold rows with collapseRow
let fixed = foldl collapseRow {collapse: collapse, st: []} rows in
fixed.st
_ -> []
where
-- | The fold function
collapseRow :: forall rr cc ss. (CollapseState rr cc ss) -> Row rr cc -> (CollapseState rr cc ss)
collapseRow state row =
-- | Zip the previous collapse array and the current cols array
-- | This results in an array of [collapse, col]
let skipCols = zip state.collapse row.cols in
-- | Get all cols where the collapse value is less than 1
-- | First the list is filtered by checking the collapse value (fst in array)
-- | Then snd is called (fmapped over) each itemm to get only the column
-- | Note that this results in the selected columns being unaltered and all additional information (fields)
-- | in the columns being retained
let nextCols = snd <$> filter (\t -> fst t <= 0) skipCols in
-- | If current collapse is zero then next skip is the span value - 1 else its collapse - 1
let nextSkip = map (\t -> if fst t == 0 then (snd t).span - 1 else (fst t) - 1) skipCols in
-- | Construct the row, change only the cols
-- | Again, note that the other fields in the row are returned unaltered
let resRow = row { cols = nextCols } in
-- | Next state
state {collapse = nextSkip, st = snoc state.st resRow }
Finally here is the Pux code to generate the HTML from the result of collapseTableArray
getTable :: forall a. Html a
getTable =
let rowData = -- See above
let c = collapseTableArray rowData in
let tableRows = map buildRow c in
div
[]
[table
[]
[thead
[]
[tr
[]
[th [] [text "year"]
,th [] [text "month"]
,th [] [text "day"]
]
]
,tbody
[]
(buildRow <$> c)
]
]
where
buildRow r =
tr
[]
(buildCol <$> r.cols)
buildCol c =
td
[attr "rowSpan" c.span]
[text $ c.val]
Here is what that would look like in JavaScript. The example uses Redux (react-redux) for the rendering and Ramda for the functional programming features. The code is compiled with babel with ES2015 syntax and the object spread operator enabled.
The data looks very similar to the PureScript data.
[{year:2016,
months:[{month:11,
days:[{day:30},
{day:10}
]
},
{month:2,
days:[{day:15},
{day:3},
{day:1}
]
}
]
},
{year:2015,
months:[{month:5,
days:[{day:20},
{day:17}
]
}
]
}
];
As with the PureScript example the first step is to create a flattened grid of the data and then call collapseTableArray on it.
const daysInYear = y => R.flatten( R.map( m => m.days, y.months ) );
const rows = R.flatten( R.map(
y => R.map(
m => R.map(
d => ({rowId: y.year + "." + m.month + "." + d.day,
cols:[{val: y.year, span: daysInYear(y).length}
,{val: m.month, span: m.days.length}
,{val: d.day, span: 1}
]
}),
m.days ),
y.months ),
data )
);
var collapsed = collapseTableArray( rows );
The JavaScript version of the collapseTableArray function works in exactly the same way. Using Rambda and the ES2015 syntax (lambdas, destructuring etc) allows the JavaScript code to be nice and functional.
const collapseTableArray = (tableArray) => {
if( R.isEmpty(tableArray) ){
return [];
}
//First collapse array is just a 0 for each col on first row
var collapse = R.map( a => 0, tableArray[0].cols );
var fixed = R.reduce(
([skip, acc],row) => {
//combine the skip list and cols
const skipCols = R.zip( skip, row.cols )
//Get the col that should not be skipped (2nd item from each of the filtered pair)
const nextCols = R.map( p => p[1], R.filter( f => f[0] <= 0, skipCols ) )
//Calculate next skip. Look at prev skip, use the rowSpan from tableArray once the previous span has been used up
const nextSkip = R.map( p => p[0] == 0 ? p[1].span - 1 : p[0] - 1, skipCols )
const res = R.concat( acc, [{...row, cols:nextCols}] );
return [nextSkip, res];
},
[collapse,[]],
tableArray );
return fixed[1];
}
Finally the data is rendered with react
return (
<div>
<table>
<thead>
<tr>
<th>Year</th>
<th>Month</th>
<th>Day</th>
</tr>
</thead>
<tbody>
{R.map( r => (<tr key={r.rowId}>
{R.map( c => (<td rowSpan={c.span} key={r.rowId + "." + c.val}>
{c.val}
</td>),
r.cols )}
</tr>),
collapsed )}
</tbody>
</table>
</div>
);
While I’ve not implemented the HTML rendering with GHCJS I think its interesting to compare the PureScript and Haskell version of the collapseTableArray functions
Starting with the Haskell types
data Col c = Col {span :: Int, cval :: c} deriving Show
data Row r c = Row {cols :: [Col c], rval :: r} deriving Show
Haskell does not have row polymorphism (e.g. extensible records) so to allow the Row and Col types to have user defined values I’ve used a record with type params
In this example I’m using Rid and Cval as defined below
The Haskell collapseTableArray is very similar to the PureScript version
-- | same as minimum but check for an empty list
minimumSafe :: Ord a => [a] -> Maybe a
minimumSafe xs =
case xs of
[] -> Nothing
_ -> Just $ minimum xs
collapseTableArray :: [Row r c] -> [Row r c]
collapseTableArray rows =
-- | To create the initial collapse array, we need to know the number of cols in a row
-- | Get the number of cols in each row and then get the minimum value
-- | Use minimumSafe to guard against an empty list
case minimumSafe $ (\r -> length $ cols r) <$> rows of
Just i ->
-- | Initial collapse array of zeros
let collapse = replicate i 0 in
-- | fold rows with collapseRow
let (c,res) = foldl collapseRow (collapse, []) rows in
res
_ ->
[]
where
-- | The fold function
collapseRow :: ([Int], [Row r c]) -> Row r c -> ([Int], [Row r c])
collapseRow (collapse, res) row =
-- | Zip the previous collapse array and the current cols array
-- | This results in a tuple of (collapse, col)
let skipCols = zip collapse $ cols row in
-- | Get all cols where the collapse value is less than 1
-- | First the list is filtered by checking the collapse value (the first value in the tuple)
-- | Then snd is called (fmapped over) each tuple to get only the column
let nextCols = snd <$> filter (\(c,_) -> c <= 0) skipCols in
-- | If current collapse is zero then next skip is the span value - 1 else its collapse - 1
let nextSkip = map (\(c,r) -> if c == 0 then (span r) - 1 else c - 1) skipCols in
-- | Construct the row, change only the cols
let resRow = row { cols = nextCols } in
(nextSkip, res <> [resRow])
A few differences to note from PureScript
deriving show
makes printing values easy. The sample project prints the result to the consoleOverall though the differences are small.
All the code for the examples is on github
setupPs.sh
pulp build
to buildnpm sttart
to build and run the site.setupBable.sh
runBabelAndWatch.sh
to build the bable jsstack build
buildAndRun.sh