Encoding of a Relational Database into Full-Text Search Database

Posted on 16 Jan 2012

Introduction

Performing full-text search on a relational database can be an expensive operation. This is typically a linear operation. In order to allow for nearly instantaneous search speeds this data must be encoded into a full-text search database. In the case of this document Lucene will be used.

Crawling the Relational Database

While the database schema can be inferred from the database itself, allowing manual configuration allows for maximum control. You may ignore some columns, for instance.

Entity Configuration

An entity configuration entry is made up of the following components:

• Type - A type can be either an :value, :entity, or :group
• Class - The class represents the kind of entity it is
• SQL - The SQL query required to produce the entities (currently implemented with ClojureQL)
• ID - The attribute column (or columns) that uniquely identify each entity
• Attrs - A listing of attributes for each entity
• Values - Defines which columns to be added to the values index from each entity class

For example,

Crawling Procedure

Each entity configuration is capable of crawling the database for its own entities. In order to do this, the point of entry to the application must pass along the database connection, full-text search database connection, and a writer for the full-text database.

Since it knows the SQL required to extract all of its entities, it takes it and executes a query against the database. For every result returned, each row is encoded into a document and added to the FTDB.

Handling Values

One SQL query would be enough to handle extract all of the values from the database, but it would not extract all unique values. In order to do this, the query is modified for every value column specified in the entity configuration file. The column values are projected out and grouped together. By doing this all unique values for each column are found. These are then encoded into a document as well and added to the FTDB.

Encoding Values, Entities, and Entity Groups

Each row returned by the database query is in the form of a map. For example,

Essentially a row contains all attributes specified in the entity configuration for that particular entity. This is enough in a relational database as additional information can be stored in places such as the table name and primary keys, but not in a full-text database. In a full-text database there is no concept of a fixed schema so this additional information must be stored in another form.

Data Structure Representation

Internally in Clojure each row is stored as a special map. Each map consists of the requested attributes, as well as additional meta-data. This meta-data contains additional information such as the type, class, and identifier for the particular entity.

An example of this would be as follows,

By encoding entities in this manner, we are able to preserve important information without polluting the data structure itself with what should be considered meta-data.

The in-memory data structure entity representation is used internally to allow for flexibility. Both rows and documents can be encoded or decoded using this format. This allows internal code to be able to handle entities from either a relational database or search results from the full-text search database. By utilizing an in-memory format we reduce code complexity as well as create more flexibility.

Encoding Data Structure

In order to support the encoding and decoding from different sources, we utilize multi-methods. This allows us to dispatch the correct encoding or decoding function based on the type given. This works as Lucene Documents are of type Document and database rows are of type IPersistentMap.

From Database Row

Each database row returned contains a superset of the attribute columns specified by the entity configuration. In addition, it also includes a unique identifier used by the database.

For the internal data structure, this gives the attributes as well as part of the :__id__ field. The problem is that some information specified in the database is missing. This information comes from the entity definition.

The entity definition provides the missing :type, :class, and other part of the unique identifier. It also specifies which attributes are of interest in the case that the database schema holds additional information that is not to be indexed. This is easily accomplished with the select-keys function.

For example,

Thus each entity data structure instance is composed of data from the following sources:

• Entity schema definition
• The database row itself

By composing together the two sources of information, a complete entity can be created.

From Existing Document

When converting back to the intermediate representation from a Lucene document, we can make certain assumptions. Each document is stored with meta-data prefixed and suffixed with two underscores. In addition, only attributes of interest are stored in each document. Thus encoding a document is as simple as pulling out required meta-data and disassociating the meta-data from the attributes.

Appendix

Internal Clojure Representation

Entity

(with-meta attrs
{:T :class :id})


Value

(with-meta attr
{:T :class})


Entity Group

(with-meta attrs
{:T :class :entities})


Lucene Representation

Entity

{:T     :entity
:class table-name
:id    table-name|id-val
attr_1 val_1
attr_2 val_2
.      .
.      .
.      .
:all   val_1 val_2 ...}


Value

{:T     :value
:class table-name|attr-name
attr   val
:all   n-gram(val)}


Entity Group

{:T        :group
:class    relation-desc
:entities [table-name|id-val table-name|id-val ...]
attr_1    val_1
attr_2    val_2
.         .
.         .
.         .
:all      val_1 val_2 ...}