You use the EXPORT TO TARGET or MIGRATE TO TARGET clauses to identify the sources of export and start queuing export
data. To enable the actual transmission of data to an export target at runtime, you include the export
and configuration
properties in the configuration. You can configure the export targets when you
initialize the database root directory. Or you can add or modify the export configuration while the database is running
using the voltadmin update or set commands.
In the configuration file, the configuration
list elements specify the target you are configuring
and which export connector to use (with the type
property). To export to multiple destinations, you
include multiple elements in the configuration
property, each specifying the target it is configuring.
For example:
deployment: export: configuration: - target: log enabled: true type: file - target: archive enabled: true type: file
You configure each export connector by specifying properties as one or more list elements to the
property
property. For example, the following YAML code enables export to comma-separated (CSV) text
files using the file prefix "MyExport".
deployment:
export:
configuration:
- target: log
enabled: true
type: file
property:
- name: type
value: csv
- name: nonce
value: MyExport
The properties that are allowed and/or required depend on the export connector you select. VoltDB comes with five export connectors:
Export to file (type "file")
Export to HTTP, including Hadoop (type "http")
Export to JDBC (type "jdbc")
Export to Kafka (type "kafka")
Export to Elasticsearch (type "elasticsearch")
In addition to the connectors shipped as part of the VoltDB software kit, an export connector for Amazon Kinesis is available from the VoltDB public Github repository (https://github.com/VoltDB/export-kinesis).
Two important points about export to keep in mind are:
Data is queued for export as soon you declare a stream or table with the EXPORT TO TARGET clause and write to it. Even if the export target has not been configured yet. Be careful not to declare export sources and forget to configure their targets, or else the export queues could grow and cause disk space issues. Similarly, when you drop the stream or table, its export queue is deleted, even if there is data waiting to be delivered to the configured export target.
VoltDB will send at least one copy of every export record to the target. It is possible, when recovering command logs or rejoining nodes, that certain export records are resent. It is up to the downstream target to handle these duplicate records. For example, using unique indexes or including a unique record ID in the export stream.
All nodes in a cluster queue export data, but only one actually writes to the external target. If one or more nodes fail, responsibility for writing to the export targets is transferred to another currently active server. It is possible for gaps to appear in the export queues while servers are offline. Normally if a gap is found, it is not a problem because another node can take over responsibility for writing (and queuing) export data.
However, in unusual cases where export falls behind and nodes fail and rejoin consecutively, it is possible for gaps to occur in all the available queues. When this happens, VoltDB issues a warning to the console (and via SNMP) and waits for the missing data to be resolved. You can also use the @Statistics system procedure with the EXPORT selector to determine exactly what records are and are not present in the queues. If the gap cannot be resolved (usually by rejoining a failed server), you must use the voltadmin export release command to free the queue and resume export at the next available record.
VoltDB uses persistent files on disk to queue export data waiting to be written to its specified target. If for any reason the export target can not keep up with the connector, VoltDB writes the excess data in the export buffer from memory to disk. This protects your database in several ways:
If the destination target is not configured, is unreachable, or cannot keep up with the data flow, writing to disk helps VoltDB avoid consuming too much memory while waiting for the destination to accept the data.
If the database stops, the export data is retained across sessions. When the database restarts, the connector will retrieve the overflow data and reinsert it in the export queue.
Even when the target does keep up with the flow, some amount of data is written to the overflow directory to
ensure durability across database sessions. You can specify where VoltDB writes the overflow export data using the
paths
and exportoverflow
properties in the configuration. For example:
deployment: paths: exportoverflow: path: /tmp/export/
If you do not specify a path for export overflow, VoltDB creates a subfolder in the database root directory. See Section 3.7.2, “Configuring Paths for Runtime Features” for more information about configuring paths in the configuration.
It is important to note that VoltDB only uses the disk storage for overflow data. However, you can force VoltDB to write all queued export data to disk using any of the following methods:
Calling the @Quiesce system procedure
Requesting a blocking snapshot (using voltadmin save --blocking)
Performing an orderly shutdown (using voltadmin shutdown)
This means that if you perform an orderly shutdown with the voltadmin shutdown command, you can recover the database — and any pending export queue data — by simply restarting the database cluster in the same root directories.
Note that when you initialize or re-initialize a root directory, any subdirectories of the root are purged.[5] So if your configuration did not specify a different location for the export overflow, and you re-initialize the root directories and then restore the database from a snapshot, the database is restored but the export overflow will be lost. If both your original and new configuration use the same, explicit directory outside the root directory for export overflow, you can start a new database and restore a snapshot without losing the overflow data.
The file connector receives the serialized data from the export source and writes it out as text files (either comma or tab separated) to disk. The file connector writes the data out one file per source table or stream, "rolling" over to new files periodically. The filenames of the exported data are constructed from:
A unique prefix (specified with the nonce
property)
A unique value identifying the current version of the database schema
The stream or table name
A timestamp identifying when the file was started
Optionally, the ID of the host server writing the file
While the file is being written, the file name also contains the prefix "active-". Once the file is complete and a new file started, the "active-" prefix is removed. Therefore, any export files without the prefix are complete and can be copied, moved, deleted, or post-processed as desired.
There is only one required property that must be set when using the file connector. The nonce
property specifies a unique prefix to identify all files that the connector writes out for this database instance. All
other properties are optional and have default values.
Table 15.1, “File Export Properties” describes the supported properties for the file connector.
Table 15.1. File Export Properties
Whatever properties you choose, the order and representation of the content within the output files is the same. The export connector writes a separate line of data for every INSERT it receives, including the following information:
Six columns of metadata generated by the export connector.
The remaining columns are the columns of the database source, in the same order as they are listed in the database definition (DDL) file.
Table 15.2, “Export Metadata” describes the six columns of metadata generated by the export connector and the meaning of each column.
Table 15.2. Export Metadata
Column | Datatype | Description |
---|---|---|
Transaction ID | BIGINT | Identifier uniquely identifying the transaction that generated the export record. |
Timestamp | TIMESTAMP | The time when the export record was generated. |
Sequence Number | BIGINT | For internal use. |
Partition ID | BIGINT | Identifies the partition that sent the record to the export target. |
Site ID | BIGINT | Identifies the site that sent the record to the export target. |
Export Operation | TINYINT | A single byte value identifying the type of transaction that initiated the export. Possible values include:
|
The HTTP connector receives the serialized data from the export streams and writes it out via HTTP requests. The connector is designed to be flexible enough to accommodate most potential targets. For example, the connector can be configured to send out individual records using a GET request or batch multiple records using POST and PUT requests. The connector also contains optimizations to support export to Hadoop via WebHDFS.
The HTTP connector is a general purpose export utility that can export to any number of destinations from simple messaging services to more complex REST APIs. The properties work together to create a consistent export process. However, it is important to understand how the properties interact to configure your export correctly. The four key properties you need to consider are:
batch.mode — whether data is exported in batches or one record at a time
method — the HTTP request method used to transmit the data
type — the format of the output
endpoint — the target HTTP URL to which export is written
The properties are described in detail in Table 15.3, “HTTP Export Properties”. This section explains the relationship between the properties.
There are essentially two types of HTTP export: batch mode and one record at a time. Batch mode is appropriate for exporting large volumes of data to targets such as Hadoop. Exporting one record at a time is less efficient for large volumes but can be very useful for writing intermittent messages to other services.
In batch mode, the data is exported using a POST or PUT method, where multiple records are combined in either
comma-separated value (CSV) or Avro format in the body of the request. When writing one record at a time, you can choose
whether to submit the HTTP request as a POST, PUT or GET (that is, as a querystring attached to the URL). When exporting
in batch mode, the method must be either POST
or PUT
and the type must be either
csv
or avro
. When exporting one record at a time, you can use the
GET
, POST
, or PUT
method, but the output type must be
form
.
Finally, the endpoint property specifies the target URL where data is being sent, using either the http: or https: protocol. Again, the endpoint must be compatible with the possible settings for the other properties. In particular, if the endpoint is a WebHDFS URL, batch mode must enabled.
The URL can also contain placeholders that are filled in at runtime with metadata associated with the export data. Each placeholder consists of a percent sign (%) and a single ASCII character. The following are the valid placeholders for the HTTP endpoint property:
Placeholder | Description |
---|---|
%t | The name of the VoltDB export source table or stream. The source name is inserted into the endpoint in all uppercase. |
%p | The VoltDB partition ID for the partition where the INSERT query to the export source is executing. The partition ID is an integer value assigned by VoltDB internally and can be used to randomly partition data. For example, when exporting to webHDFS, the partition ID can be used to direct data to different HDFS files or directories. |
%g | The export generation. The generation is an identifier assigned by VoltDB. The generation increments each time the database starts or the database schema is modified in any way. |
%d | The date and hour of the current export period. Applicable to WebHDFS export only. This placeholder identifies the start of each period and the replacement value remains the same until the period ends, at which point the date and hour is reset for the new period. You can use this placeholder to "roll over"
WebHDFS export destination files on a regular basis, as defined by the |
When exporting in batch mode, the endpoint must contain at least one instance each of the %t, %p, and %g placeholders. However, beyond that requirement, it can contain as many placeholders as desired and in any order. When not in batch mode, use of the placeholders are optional.
Table 15.3, “HTTP Export Properties” describes the supported properties for the HTTP connector.
Table 15.3. HTTP Export Properties
Property | Allowable Values | Description |
---|---|---|
endpoint* | string | Specifies the target URL. The endpoint can contain placeholders for inserting the source name (%t), the partition ID (%p), the date and hour (%d), and the export generation (%g). |
avro.compress | true, false | Specifies whether Avro output is compressed or not. The default is false and this property is ignored if the type is not Avro. |
avro.schema.location | string | Specifies the location where the Avro schema will be written. The schema location can be either an
absolute path name on the local database server or a webHDFS URL and must include at least one instance of the
placeholder for the source name (%t). Optionally it can contain other instances of both %t and %g. The default
location for the Avro schema is the file path export/avro/%t_avro_schema.json on the
database server under the voltdbroot directory. This property is ignored if the type is not Avro. |
batch.mode | true, false | Specifies whether to send multiple rows as a single request or send each export row separately. The default is true. Batch mode must be enabled for WebHDFS export. |
httpfs.enable | true, false | Specifies that the target of WebHDFS export is an Apache HttpFS (Hadoop HDFS over HTTP) server. This property must be set to true when exporting webHDFS to HttpFS targets. |
kerberos.enable | true, false | Specifies whether Kerberos authentication is used when connecting to a WebHDFS endpoint. This property is only valid when connecting to WebHDFS servers and is false by default. |
method | get, post, put | Specifies the HTTP method for transmitting the export data. The default method is POST. For WebHDFS export, this property is ignored. |
period | Integer | Specifies the frequency, in hours, for "rolling" the WebHDFS output date and time. The default frequency is every hour (1). For WebHDFS export only. |
timezone | string | The time zone to use when formatting the timestamp. Specify the time zone as a Java TimeZone identifier. For example, you can specify a continent and region ("America/New_York") or a time zone acronym ("EST"). The default is the local time zone. |
type | csv, avro, form | Specifies the output format. If batch.mode is true, the default type is CSV. If batch.mode is false, the default and only allowable value for type is form. Avro format is supported for WebHDFS export only (see Section 15.3.3.2, “Exporting to Hadoop via WebHDFS” for details.) |
*Required |
As mentioned earlier, the HTTP connector contains special optimizations to support exporting data to Hadoop via the WebHDFS protocol. If the endpoint property contains a WebHDFS URL (identified by the URL path component starting with the string "/webhdfs/v1/"), special rules apply.
First, for WebHDFS URLs, the batch.mode property must be enabled. Also, the endpoint must have at least one instance each of the source name (%t), the partition ID (%p), and the export generation (%g) placeholders and those placeholders must be part of the URL path, not the domain or querystring.
Next, the method property is ignored. For WebHDFS, the HTTP connector uses a combination of POST, PUT, and GET requests to perform the necessary operations using the WebHDFS REST API.
For example, The following configuration file exports stream data to WebHDFS using the HTTP connector and writing each stream to a separate directory, with separate files based on the partition ID, generation, and period timestamp, rolling over every 2 hours:
deployment: export: configuration: - target: hadoop enabled: true type: http property: - name: endpoint value: "http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv" - name: batch.mode value: true - name: period value: 2
Note that the HTTP connector will create any directories or files in the WebHDFS endpoint path that do not currently exist and then append the data to those files, using the POST or PUT method as appropriate for the WebHDFS REST API.
You also have a choice between two formats for the export data when using WebHDFS: comma-separated values (CSV)
and Apache Avro™ format. By default, data is written as CSV data with each record on a separate line and batches of
records attached as the contents of the HTTP request. However, you can choose to set the output format to Avro by
setting the type
property, as in the following example:
deployment: export: configuration: - target: hadoop enabled: true type: http property: - name: endpoint value: "http://myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.avro" - name: type value: avro - name:avro.compress value: true - name: avro.schema.location value: "http://myhadoopsvr/webhdfs/v1/%t/schema.json"
Avro is a data serialization system that includes a binary format that is used natively by Hadoop utilities such
as Pig and Hive. Because it is a binary format, Avro data takes up less network bandwidth than text-based formats such
as CSV. In addition, you can choose to compress the data even further by setting the avro.compress
property to true, as in the previous example.
When you select Avro as the output format, VoltDB writes out an accompanying schema definition as a JSON document. For compatibility purposes, the source name and columns names are converted, removing underscores and changing the resulting words to lowercase with initial capital letters (sometimes called "camelcase"). The source name is given an initial capital letter, while columns names start with a lowercase letter. For example, the stream EMPLOYEE_DATA and its column named EMPLOYEE_iD would be converted to EmployeeData and employeeId in the Avro schema.
By default, the Avro schema is written to a local file on the VoltDB database server. However, you can specify an
alternate location, including a webHDFS URL. So, for example, you can store the schema in the same HDFS repository as
the data by setting the avro.schema.location
property, as shown in the preceding example.
See the Apache Avro web site for more details on the Avro format.
If the WebHDFS service to which you are exporting data is configured to use Kerberos security, the VoltDB servers must be able to authenticate using Kerberos as well. To do this, you must perform the following two extra steps:
Configure Kerberos security for the VoltDB cluster itself
Enable Kerberos authentication in the export configuration
The first step is to configure the VoltDB servers to use Kerberos as described in Section 12.9, “Integrating Kerberos Security with VoltDB”. Because use of Kerberos authentication for VoltDB security changes how the clients connect to the database cluster, It is best to set up, enable, and test Kerberos authentication first to ensure your client applications work properly in this environment before trying to enable Kerberos export as well.
Once you have Kerberos authentication working for the VoltDB cluster, you can enable Kerberos authentication in
the configuration of the WebHDFS export target as well. Enabling Kerberos authentication in the HTTP connector only
requires one additional property, kerberos.enable
, to be set. To use Kerberos authentication, set the
property to "true". For example:
deployment:
export:
configuration:
- target: hadoop
enabled: true
type: http
property:
- name: endpoint
value: "myhadoopsvr/webhdfs/v1/%t/data%p-%g.%d.csv"
- name: type
value: csv
- name: kerberos.enable
value: true
Note that Kerberos authentication is only supported for WebHDFS endpoints.
The JDBC connector receives the serialized data from the export source and writes it, in batches, to another database through the standard JDBC (Java Database Connectivity) protocol.
By default, when the JDBC connector opens the connection to the remote database, it first attempts to create tables in the remote database to match the VoltDB export source by executing CREATE TABLE statements through JDBC. This is important to note because, it ensures there are suitable tables to receive the exported data. The tables are created using either the names from the VoltDB schema or (if you do not enable the ignoregenerations property) the name prefixed by the database generation ID.
If the target database has existing tables that match the VoltDB export sources in both name and structure (that is, the number, order, and datatype of the columns), be sure to enable the ignoregenerations property in the export configuration to ensure that VoltDB uses those tables as the export target.
It is also important to note that the JDBC connector exports data through JDBC in batches. That is, multiple INSERT instructions are passed to the target database at a time, in approximately two megabyte batches. There are two consequences of the batching of export data:
For many databases, such as Netezza, where there is a cost for individual invocations, batching reduces the performance impact on the receiving database and avoids unnecessary latency in the export processing.
On the other hand, no matter what the target database, if a query fails for any reason the entire batch fails.
To avoid errors causing batch inserts to fail, it is strongly recommended that the target database not use unique indexes on the receiving tables that might cause constraint violations.
If any errors do occur when the JDBC connector attempts to submit data to the remote database, the VoltDB disconnects and then retries the connection. This process is repeated until the connection succeeds. If the connection does not succeed, VoltDB eventually reduces the retry rate to approximately every eight seconds.
Table 15.4, “JDBC Export Properties” describes the supported properties for the JDBC connector.
Table 15.4. JDBC Export Properties
Property | Allowable Values | Description |
---|---|---|
jdbcurl* | connection string | The JDBC connection string, also known as the URL. |
jdbcuser* | string | The username for accessing the target database. |
jdbcpassword | string | The password for accessing the target database. |
jdbcdriver | string | The class name of the JDBC driver. The JDBC driver class must be accessible to the VoltDB process for
the JDBC export process to work. Place the driver JAR files in the You do not need to specify the driver as a property value for several popular databases, including MySQL, Netezza, Oracle, PostgreSQL, and Vertica. However, you still must provide the driver JAR file. |
schema | string | The schema name for the target database. The use of the schema name is database specific. In some cases you must specify the database name as the schema. In other cases, the schema name is not needed and the connection string contains all the information necessary. See the documentation for the JDBC driver you are using for more information. |
minpoolsize | integer | The minimum number of connections in the pool of connections to the target database. The default value is 10. |
maxpoolsize | integer | The maximum number of connections in the pool. The default value is 100. |
maxidletime | integer | The number of milliseconds a connection can be idle before it is removed from the pool. The default value is 60000 (one minute). |
maxstatementcached | integer | The maximum number of statements cached by the connection pool. The default value is 50. |
createtable | true, false | Specifies whether VoltDB should create the corresponding table in the remote database. By default , VoltDB creates the table(s) to receive the exported data. (That is, the default is true.) If you set this property to false, you must create table(s) with matching names to the VoltDB export sources before starting the export connector. |
lowercase | true, false | Specifies whether VoltDB uses lowercase table and column names or not. By default, VoltDB issues SQL statements using uppercase names. However, some databases (such as PostgreSQL) are case sensitive. When this property is set to true, VoltDB uses all lowercase names rather than uppercase. The default is false. |
ignoregenerations | true, false | Specifies whether a unique ID for the generation of the database is included as part of the output table name(s). The generation ID changes each time a database restarts or the database schema is updated. The default is false. |
skipinternals | true, false | Specifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as true, the output contains only the exported stream data. The default is false. |
*Required |
The Kafka connector receives serialized data from the export sources and writes it to a message queue using the Apache Kafka version 0.10.2 protocols. Apache Kafka is a distributed messaging service that lets you set up message queues which are written to and read from by "producers" and "consumers", respectively. In the Apache Kafka model, VoltDB export acts as a "producer" capable of writing to any Kafka service using version 0.10.2 or later.
Before using the Kafka connector, we strongly recommend reading the Kafka documentation and becoming familiar with the software, since you will need to set up a Kafka service and appropriate "consumer" clients to make use of VoltDB's Kafka export functionality. The instructions in this section assume a working knowledge of Kafka and the Kafka operational model.
When the Kafka connector receives data from the VoltDB export sources, it establishes a connection to the Kafka messaging service as a Kafka producer. It then writes records to Kafka topics based on the VoltDB stream or table name and certain export connector properties.
The majority of the Kafka export properties are identical in both in name and content to the Kafka producer
properties listed in the Kafka documentation. All but one of these properties are optional for the Kafka connector and
will use the standard Kafka default value. For example, if you do not specify the
queue.buffering.max.ms
property it defaults to 5000 milliseconds.
The only required property is bootstrap.servers
, which lists the Kafka servers that the VoltDB
export connector should connect to. You must include this property so VoltDB knows where to send the export data. Specify
each server by its IP address (or hostname) and port; for example, myserver:7777. If there are multiple servers in the
list, separate them with commas.
In addition to the standard Kafka producer properties, there are several custom properties specific to VoltDB. The
properties binaryencoding
, skipinternals
, and timezone
affect the
format of the data. The topic.prefix
and topic.key
properties affect how the data is
written to Kafka.
The topic.prefix
property specifies the text that precedes the stream or table name when
constructing the Kafka topic. If you do not specify a prefix, it defaults to "voltdbexport". Alternately, you can map
individual sources to topics using the topic.key
property. In the topic.key
property
you associate a VoltDB export source name with the corresponding Kafka topic as a named pair separated by a period (.).
Multiple named pairs are separated by commas (,). For example:
Employee.EmpTopic,Company.CoTopic,Enterprise.EntTopic
Any mappings in the topic.key
property override the automated topic name specified by
topic.prefix
.
Note that unless you configure the Kafka brokers with the auto.create.topics.enable
property set
to true, you must create the topics for every export source manually before starting the export process. Enabling
auto-creation of topics when setting up the Kafka brokers is recommended.
When configuring the Kafka export connector, it is important to understand the relationship between synchronous versus asynchronous processing and its effect on database latency. If the export data is sent asynchronously, the impact of export on the database is reduced, since the export connector does not wait for the Kafka infrastructure to respond. However, with asynchronous processing, VoltDB is not able to resend the data if the message fails after it is sent.
If export to Kafka is done synchronously, the export connector waits for acknowledgement of each message sent to Kafka before processing the next packet. This allows the connector to resend any packets that fail. The drawback to synchronous processing is that on a heavily loaded database, the latency it introduces means export may not be able to keep up with the influx of export data and and have to write to overflow.
You specify the level of synchronicity and durability of the connection using the Kafka acks
property. Set acks
to "0" for asynchronous processing, "1" for synchronous delivery to the Kafka
broker, or "all" to ensure durability on the Kafka broker. See the Kafka documentation for more information.
VoltDB guarantees that at least one copy of all export data is sent by the export connector. But when operating in asynchronous mode, the Kafka connector cannot guarantee that the packet is actually received and accepted by the Kafka broker. By operating in synchronous mode, VoltDB can catch errors returned by the Kafka broker and resend any failed packets. However, you pay the penalty of additional latency and possible export overflow.
Finally, the actual export data is sent to Kafka as a comma-separated values (CSV) formatted string. The message includes six columns of metadata (such as the transaction ID and timestamp) followed by the column values of the export stream.
Table 15.5, “Kafka Export Properties” lists the supported properties for the Kafka connector, including the standard Kafka producer properties and the VoltDB unique properties.
Table 15.5. Kafka Export Properties
Property | Allowable Values | Description |
---|---|---|
bootstrap.servers* | string | A comma-separated list of Kafka brokers (specified as IP-address:port-number). You can use
|
acks | 0, 1, all | Specifies whether export is synchronous (1 or all) or asynchronous (0) and to what extent it ensures delivery. The default is all, which is recommended to avoid possibly losing messages when a Kafka server becomes unavailable during export. See the Kafka documentation of the producer properties for details. |
acks.retry.timeout | integer | Specifies how long, in milliseconds, the connector will wait for acknowledgment from Kafka for each packet.
The retry timeout only applies if acknowledgements are enabled. That is, if the acks property
is set greater than zero. The default timeout is 5,000 milliseconds. When the timeout is reached, the connector
will resend the packet of messages. |
binaryencoding | hex, base64 | Specifies whether VARBINARY data is encoded in hexadecimal or BASE64 format. The default is hexadecimal. |
skipinternals | true, false | Specifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as true, the output contains only the exported stream data. The default is false. |
timezone | string | The time zone to use when formatting the timestamp. Specify the time zone as a Java TimeZone identifier. For example, you can specify a continent and region ("America/New_York") or a time zone acronym ("EST"). The default is GMT. |
topic.key | string | A set of named pairs each identifying a VoltDB source name and the corresponding Kafka topic name to which the data is written. Separate the names with a period (.) and the name pairs with a comma (,). The specific source/topic mappings declared by topic.key override the automated topic names specified by topic.prefix. |
topic.prefix | string | The prefix to use when constructing the topic name. Each row is sent to a topic identified by {prefix}{source-name}. The default prefix is "voltdbexport". |
Kafka producer properties | various | You can specify standard Kafka producer properties as export connector properties and their values
will be passed to the Kafka interface. However, you cannot modify the property
|
*Required |
The Elasticsearch connector receives serialized data from the export source and inserts it into an Elasticsearch server or cluster. Elasticsearch is an open-source full-text search engine built on top of Apache Lucene™. By exporting selected tables and streams from your VoltDB database to Elasticsearch you can perform extensive full-text searches on the data not possible with VoltDB alone.
Before using the Elasticsearch connector, we recommend reading the Elasticsearch documentation and becoming familiar with the software. The instructions in this section assume a working knowledge of Elasticsearch, its configuration and its capabilities.
The only required property when configuring Elasticsearch is the endpoint, which identifies the location of the Elasticsearch server and what index to use when inserting records into the target system. The structure of the Elasticsearch endpoint is the following:
<protocol>://<server>:<port>//<index-name>//<type-name>
For example, if the target Elasticsearch service is on the server esearch.lan
using the
default port (9200) and the exported records are being inserted into the employees
index as documents
of type person
, the endpoint declaration would look like this:
property: - name: endpoint value: http://esearch.lan:9200/employees/person
You can use placeholders in the endpoint that are replaced at runtime with information from the export data, such as the source name (%t), the partition ID (%p), the export generation (%g), and the date and hour (%d). For example, to use the source name as the index name, the endpoint might look like the following:
property:
- name: endpoint
value: "http://esearch.lan:9200/%t/person"
When you export to Elasticsearch, the export connector creates the necessary index names and types in Elasticsearch (if they do not already exist) and inserts each record as a separate document with the appropriate metadata. Table 15.6, “Elasticsearch Export Properties” lists the supported properties for the Elasticsearch connector.
Table 15.6. Elasticsearch Export Properties
Property | Allowable Values | Description |
---|---|---|
endpoint* | string | Specifies the root URL of the RESTful interface for the Elasticsearch cluster to which you want to export the data. The endpoint should include the protocol, host name or IP address, port, and path. The path is assumed to include an index name and a type identifier. The export connector will use the Elasticsearch RESTful API to communicate with the server and insert records into the specified locations. You can use placeholders to replace portions of the endpoint with data from the exported records at runtime, including the source name (%t), the partition ID (%p), the date and hour (%d), and the export generation (%g). |
batch.mode | true, false | Specifies whether to send multiple rows as a single request or send each export row separately. The default is true. |
timezone | string | The time zone to use when formatting timestamps. Specify the time zone as a Java TimeZone identifier. For example, you can specify a continent and region ("America/New_York") or a time zone acronym ("EST"). The default is the local time zone. |
*Required |
[5] Initializing a root directory deletes any files in the command log and overflow directories. The snapshots directory is archived to a named subdirectory.