Lingo

All posts tagged Lingo

OTAS Lingo is our natural language app to summarise many of our analyses in a textual report. Users can subscribe to daily, weekly or monthly Lingo emails for a custom portfolio of stocks or a public index, and receive a paragraph for each metric we offer, highlighting important changes across the time period. Our method of natural language generation is simple: given the state of the underlying data, we generate at least one sentence for each of the metric’s sub-topics, and concatenate the results to form a paragraph. This method of text generation has proved to be very customisable and extensible, despite being time-consuming to implement, and the results are very readable. However, given that we specify hundreds of these sentences in the code base, some of which are rendered very infrequently, how do we leverage compile-time checking to overcome the runtime issues of tools such as printf?

First, a little background. We render our natural language both as HTML and plain text, so we abstract the underlying structure with a GeneratedText type. This type represents dates, times and our internal security identifiers (OtasSecurityIds), in addition to raw text values. HTML and plain text are rendered with two respective functions, which use additional information, such as security identifier format (name, RIC code etc.), to customise the output. Here’s a simplified representation of the generated text type.

type HtmlClass = T.Text

data Token 
  = Text T.Text -- ^ Strict Text from text package
  | Day Day -- ^ From time package
  | OtasId OtasSecurityId 
  | Span [HtmlClass] [Token] -- ^ Wrap underlying tokens in span tag, with given classes, when rendering as HTML

newtype GeneratedText = GeneratedText [Token]
  deriving (Monoid)

We expose GeneratedText as a newtype, and not a type synonym, to hide the internal implementation as much as possible. Users are provided with functions such as liftOtasId :: OtasSecurityId-> GeneratedText, which generates a single-token text value, an IsString instance for convenience and a Monoid instance for composition. With this, we can also express useful combinators such as list :: [GeneratedText] -> GeneratedText, which will render a list more naturally as English. These combinators may be used to construct sentences such as the following.

-- ^ List stocks with high volume.
-- E.g. High volume was observed in Vodafone, Barclays and Tesco.
volumeText :: [OtasSecurityId] -> GeneratedText
volumeText sids
  = "High volume was observed in " <> list (map liftOtasId sids) <> "."

This definition is okay, but it is helpful to separate the format of the generated text from the underlying data as much as possible. Think about C-style format strings, which completely separate the format from the arguments to the format. Our code base contains hundreds of text sentences, so the code is more readable with this distinction in place. Additionally, there’s value in being able to treat formats as first class, allowing the structure of a sentence to be built up from sub-formats before being rendered to a generated text value. Enter formatting.

The formatting package is a fantastic type-safe equivalent to printf, allowing construction of typed format strings which can be rendered to strict of lazy text values. The type and number of expected arguments are encoded in the type of the formatter, so everything is checked at compile time: there are no runtime exceptions with formatting. The underlying logic is built upon the HoleyMonoid package, and many formatters are provided to render, for example, numbers with varying precision and in different bases. Formatting uses a text Builder under the hood, but the types will work with any monoid, such as our GeneratedText type, although we do need to reimplement the underlying functionality and formatters. However, this gives us a very clean way of expressing the format of our text:

volumeFormat :: Format r ([OtasSecurityId] -> r)
volumeFormat
  = "High volume was observed in " % list sid % "."

volumeText :: [OtasSecurityId] -> GeneratedText
volumeText 
  = format volumeFormat

The Format type expresses the type and number of arguments expected by the format string, in the correct order. volumeFormat above expects a list of OtasSecurityIds, so has type Format r ([OtasSecurityId] -> r), whereas a format string expecting an OtasSecurityId and a Double would have type Format r (OtasSecurityId -> Double -> r). The r parameter represents the result type, which, in our case, will always be a GeneratedText. Formatters are composed with (%), which keeps track of the expected argument, and a helpful IsString instance exists to avoid lifting text values. Finally, format will convert a format string to a function expecting the correct arguments, and yielding a GeneratedText value.

The format string style is especially beneficial as the text becomes longer, as the format is described without any noise from the data being textualised. We use a range of formatters for displaying dates, times, numeric values and security IDs, but we can also encode more complex behaviours as higher-order formatters, such as list above. Users can also define custom formatters to abstract more specialised representations. In practice, we specify all our generated text in Lingo in this way: it’s simply a clearer style of programming.