I have spent a significant amount of time now working on my final project for the course on natural language processing I have been doing- a research paper! The topic I chose to cover evolved from being a general survey of the application of functional programming to natural language processing to a more specific look at how we might formulate finite state transducers as the transducers Rich Hickey has been working on. Below is the final result of my efforts. If you are interested in Clojure's transducers, and have any constructive feedback or ideas about the things I've been thinking about, I'd love to hear about it! You can reach me on Twitter!
Functional programming is quickly becoming more and more prominent in the software engineering world. Several predominantly object-oriented languages like Java and C++ have in the past few years adopted features that improve their capacity to allow for programming in a functional style, and functional languages like Clojure, Scala, and, in particular, Haskell are finding a place in the hearts of many and in the industry. Functional programming techniques give us powerful tools to express ourselves concisely and to build efficient and elegant programs. In this paper I will explore one particular application of functional programming to natural language processing. I will investigate transducers  as formulated by Rich Hickey, the creator of the Clojure programming language , and look at how they might be used to implement finite state transducers traditionally used in natural language processing. While there are a number of subjects I could reflect upon to provides a sense of how functional programming is applied to this field, I feel that it is this emerging subject, transducers, that provide the most interesting potential to provide both the functional programming and natural language processing communities each with the greatest benefit and innovation. Throughout this paper, I will use the Haskell programming language  to demonstrate some core ideas and thoughts that we will want to explore to understand Hickey's transducers.
To begin our discussion, let us first address the question of what functional programming is. While fractions of the functional programming community have not reached total consensus with regards to exactly what attributes uniquely define the field, there are a few key traits that are widely accepted. First is the focus on and support for higher-order and first class functions, which accept other functions as parameters and can return functions. Second is the focus on purity. The phrase referential transparency is often used here to describe the feature that functions do not have side-effects and will always return the same result when called on the same input, giving them the property of being much more like functions of the mathematical variety. A third attribute of the functional style is that recursion is preferred to iteration. We will see that recursive functions can embody all of the functionality of iterative functions and that a particular variety of recursive functions- namely, folds- have some very powerful properties. Functional programs also tend to be written in a declarative style rather than an imperative one. This means that, rather than writing code that tells the computer exactly how to compute something, our functions describe what it is that is to be computed. Functional programming also encourages the separation of functionality from data- quite the opposite of object-oriented languages that encourage the notion that the two should be combined. As a result of this property, it is often said that everything in a functional programming language is a value. This statement is critical to our understanding of the role functions play in programs, as functions are themselves values subject to being operated on with operations such as composition. Functional programming languages also often feature powerful type systems, such as that of Haskell, but can be dynamically typed as Clojure is.
In our first case study, we will have a look at the incredibly important fold function, which embodies a great deal of the features of a functional programming language outlined above.
fold :: (b -> a -> b) -> b -> [a] -> b fold fn acc  = acc fold fn acc ls = fold fn (fn acc next) rest where next = head ls rest = tail ls
It is critical that we understand this function for further discussion, so I will take my time here to make sure it is explained thoroughly. The fold function is used to reduce a list of values into a single value using another function that defines how to accumulate the value to compute. The first line of this snippet is the type signature. Haskell supports the use of type variables which, in this case, are much like the generics feature supported by other languages. The type of fold is that of a function whose first argument is a function from some type b and another parameter of type a which returns a value of the first type b. We will refer to this as our reducing function. The first parameter it accepts is the currently accumulated value and the second is the next value from the sequence being reduced. Clearly, the reducing function returns the next accumulative value. The type of the second argument to fold is b, the same type that is being accumulated. This parameter will be the initial value for the reduction upon the first invocation of fold. Finally, the type of the third parameter is [a], a list of input elements to reduce, and the output type is b, the type of the accumulated value.
The implementation of the fold function here makes use of pattern matching in the first equation to state that, when the function is invoked on an empty input sequence, the only value appropriate to produce is the initial value, and this is also the base case of our recursive definition. The second equation defines the recursive case, where we evaluate to a recursive call to fold with the same reducing function, but the accumulator is updated with a call to the reducing function provided with the currently accumulated value and the head- the first element- of the remaining input sequence. Naturally, the last value provided to our recursive invocation is that of the tail- all but the head- of the sequence. We observe three particularly significant aspects of a functional programming language at work here. First is the use of pattern matching to elegantly define the control structure necessary to handle the base case where the input sequence is empty. Second is the use of recursion opposed to iteration to operate on successive elements of the input sequence. Finally, we observe that, rather than directly mutating the value of a variable, we take advantage of the fact that recursive calls provide function invocations with new scopes and simply update the accumulated value in the recursive call, and achieve purity.
Let us now look at a couple of applications of the fold function.
sum :: [Integer] -> Integer sum ls = fold (+) 0 ls reverse :: [a] -> [a] reverse xs = fold (\xs x -> [x] ++ xs)  xs
These two functions, sum and reverse are quite naturally both implemented as folds. The sum function reduces a sequence of integers with addition. The reverse function reduces a list by flipping successive elements to the front of the reversed list it accumulates. We could, of course, have written both of these functions using the same explicitly recursive structure that we used to define fold itself. The fact that we can replace such definitions by a simple call to fold is what is referred to as the universal property of fold , formally stated with the following equivalence:
g  = v g (x:xs) = f x (g xs) <=> [Universal property of fold equivalence] g = fold f v
Here, we use Haskell's pattern matching to deconstruct the list argument passed to g into the head element, x, and the tail of the list, xs. The final argument to fold is made implicit in this relation. This equivalence is of particular interest to us because of the properties fold has. In particular, it is very easy to reason precisely about how fold operates on a sequence. This means that we do not have to rely on inductive proofs to prove equivalence between recursive functions on lists, but can do so using the universal property of fold.
With this foundational understanding of how folding functions work we can now move onto the subject of reducers . Briefly, we will first abstract the notion of a reducing function to something more general. The goal here is to produce something that extracts the core essence of functions like map, filter, and flatmap, among others, and that is input independent. Let us redefine these functions.
mapping f = \glue -> reducer glue where reducer glue res i = glue res (f i) filtering p = \glue -> reducer glue where reducer glue res i = if p i then (glue res i) else res flatmapping f = \glue -> reducer glue where reducer glue res i = fold glue res (f i)
Each of these accomplish exactly what we'd hoped for. Both mapping and flatmapping accept a function as their first parameter, which they will apply to inputs, and filtering accepts a predicate function that it uses to decide whether the inputs it receives will be included in the reduced output. Each of these functions evaluate to a new function that accepts a glue parameter, which is itself a function that defines how successive results are to be combined. It is the reducer function that is produced from each of these functions that applies no more than the core logic of each of the functions we defined here. It is also important to note that none of these functions are defined specifically in terms of any kind of collection. The reducer functions here are exactly the kind of functions we would supply to fold.
We now have everything we need to begin discussing the subject of transducers. A transducer is, in the simplest sense, a function that operates on a reducer to produce a new reducer. In Hickey's talk at Strange Loop  this year, he presented some early Clojure code that implements some of the core functionality transducers will support. In his talk, he confessed that he feels that producing accurate and complete types for transducers is something he expects may be slightly challenging. Although there is a great deal of discussion and some work being done to implement transducers in Haskell ,  with types, I will not include any of the code written by others so far in this paper. Prior to Hickey's talk, he had written about transducers and provided a sort of "pseudotype" that captures the essence of what a transducer is.
(whatever, input -> whatever) -> (whatever, input -> whatever)
Transducers appear to have strong fundamental properties  associated with them. That is, it certainly seems, when we look at their characteristics, that they are an obvious thing to want to be able to work with. Of particular interest to us are the facts that transducers do not need to be made aware of what the reducers they manipulate do, the context they are used in, where the input they are to operate on will come from, or when the reducers actually perform any computations when, and indeed if, any input arrives. Hickey has invested in transducers as a way to work with the many functions that can be implemented as folds in a way that is agnostic to input- they will work on channels in concurrent settings, with GUI event sequences, regular lists, and can be applied to many other situations. Another interesting property of transducers, which results from the way they are constructed, is that they eliminate the need to produce intermediate sequences in the case of list properties. This means that the following two programs are equivalent, but the first will be more efficient.
transform = (tmap increment) . (tfilter even) . (tmap square) transduce transform [1..10] <=> function = (map increment) . (filter even) . (map square) function [1..10]
The second program will create a new list after squaring the numbers in the initial sequence, another after filtering it, and another to increment the elements. The first, however, will apply each reducer in succession as it builds only one new list for the result. As previously stated, the first will also be able to accept input from such sources as channels, whereas the second will only work on lists.
Some of the properties of transducers that we have already seen give us a hint as to how we might be able to formulate finite state transducers in terms of Hickey's transducers, but they allow for a couple more features that we can make use of. Transducers are described as also being able to support two other powerful operations. The first is early termination- it is possible to send a signal to a transducer to tell it to reject any further input. The second is completion- a transducer can be told that it has successfully completed its operation and that it can ignore any future input. These two features in particular strongly suggest that the finite state transducers we are used to working with might be able to be implemented as transducers.
Let us now consider finite state transducers, which we are already familiar with from the field of natural language processing. Formally, a finite state transducer is defined as a six-tuple containing a set of states, an input alphabet, an output alphabet, a set of starting states, a set of final states, and a set of relations between states and members of the input alphabet to pairs of a state and members of the output alphabet. In Haskell, we might directly model such a structure with some type synonyms and a data type.
type State = [Integer] type Alphabet = String type Transition = State -> Alphabet -> [(Alphabet, State)] data FST = FST [State] Alphabet Alphabet [State] [State] [Transition]
While this representation is fairly simple and true to the formal definition of a finite state transducer, it would be unpleasant to work with in our code, as we would have to exhaustively search the list of transition functions to find a transition from a given state when provided some input. There are a number of operations that can be performed on FSTs, such as union, intersection, and composition. Any model for FSTs that we might come up with would have to allow for these operations to be defined, ideally in an efficient and elegant manner.
As a matter of discussion, I would like to consider what I think are some possible correspondences between finite state transducers and the kinds of transducers Richard Hickey is creating for the Clojure programming language. It appears to me that the transition relations in FSTs are themselves essentially folding functions that accumulate a state and an output string. Let us recall (with slightly nicer naming) the type that a reducing function has, and the type we produced for a transition relation.
type ReducingFn = accumulated -> input -> accumulated type Transition = State -> Alphabet -> [(Alphabet, State)] Indeed, with very little tweaking, we could express the transition type to accept a single-element list containing a pair whose first entry is an accumulated output string and whose second entry is the current state. Admittedly, this will diminish the purity of our implementation, but it would allow us to take advantage of all the machinery provided by fold. The type of a transition becomes the following.
type Transition = [(Alphabet, State)] -> Alphabet -> [(Alphabet, State)]
As a subject for further research, one might investigate the potential to frame finite state transducers in the framework of the kind of transducer we have explored in this paper. It is my suspicion that one could implement transition relations in terms of reducing functions quite naturally. In particular, we can see how a simple transition such as (q1, a:b, q2) could be implemented as a mapping that accepts the value [(a, q1)] and returns [(b, q2)]. Similarly, a transition such as (q1, a:E, q2) that deletes the character 'a' could be implemented as a filtering. A non-deterministic FST that allows for multiple transitions with the same upper or lower string would just be a flatmapping, or a flatmapping composed with a filtering if epsilon transitions are allowed. It may even be possible to implement FSTs simply by returning to the formal definition we saw earlier. In this construction, we would be able to implement the set of transition relations as one composed transducer. However, because this approach would result in a single entity- the composed transducer- that, based on what I've read so far, I do not believe would be possible to decompose for FST operations such as composition. However, it may be possible to implement operations such as intersection and union very simply using transducer composition. The exact details of how these operations on finite state transducers could be implemented would need to be explored but, because we are working with an abstraction of the fold function we saw much earlier, we would have a powerful foundation to reason formally about our constructions and both investigate and prove their properties.
Transducers provide us with a new way to think about problems that can be formalized as a fold. Interestingly, a finite state transducer expressed as a transducer would require no more space to store than the transducer itself, which is simply a function. We would not be required to create a graph data structure to represent the FST, which would clearly mean saving an incredible amount of memory. An interesting observation is that, due to our choice of the [Integer] type to represent states, the composition of two simple FSTs with states such as  and  would have states [1, 1], [1, 2], [2, 1], and [2, 2]. This means that the composed transition relations would have to be constructed to perform no operations on states with only one identifier, and only transition on states containing two identifiers. In this sort of construction, the complexity of our transducer would grow, but it would be rather simple to create a composed FST by composing a transducer that performs these composed operations onto either of the original FSTs- or, perhaps, their composition. Clearly, there is a great deal to consider in working to formulate FSTs as transducers, however these thoughts lead me to believe that there may in fact be a number of interesting properties to be discovered with regards to FSTs represented as transducers.
We have learned about some core attributes of functional programming, and taken a deep look at the particularly powerful fold function, which laid the foundation for our appreciation of the kinds of benefits functional programming might offer us when we consider implementing algorithms and techniques from natural language processing. The fold function in particular offers us a powerful and concise way to define many recursive functions such that we abstract the logic for combining results from a sequence of inputs out into a function that we provide to fold. It also comes with special properties- such as universality- that we can exploit to write simple proofs for our algorithms. We then moved on to discuss reducers, which we abstracted into the mapping, filtering, and flatmapping functions to isolate the core logic of each of these functions and make them input independent.
Abstracting again over reducers to produce transducers, which we defined as functions that transform reducers, we briefly discussed some of the properties transducers possess as well as some of the kinds of operations they support. We began to outline a model that we can use to think of finite state transducers as transducers, and then took a look at the formal definition of a finite state transducer and reformulated the notion of a transition as a reducing function. Having made this comparison, we began to discuss how we might make use of some of the features of transducers to define finite state transducers in a particularly efficient fashion. As not much work has been done to investigate transducers, and their official implementation in the Clojure programming language is not yet available, we have to speculate about how we might work with and implement them. This, of course, does not prevent us from being imaginative, and we saw that there at least appears to be potential for a very powerful representation of finite state transducers as transducers. It is my hope going forward that transducers will receive more attention in the future and that sufficient implementations- ideally including complete type signatures- will be developed, and that others might feel as convinced of the potential for transducers and functional programming in general to have a positive impact on how we reason about and develop natural language processing systems.