What’s New in Kognitio 8.2.2
This blog will take you through the new features in the latest Kognitio release, version 8.2.2, which is now available for download. Version 8.2.2 was designed around feedback from Kognitio users to resolve some key challenges they were experiencing. The release was designed with these goals in mind:
- Make it easier to connect to Hive tables
Our users wanted us to simplify the process of bringing Hive tables into Kognitio, reducing the number of steps involved and reducing the management overhead.
- Improve the running of queries directly against external data
Kognitio’s primary use case is building memory images from external data and running queries against the images, but some of our users were also using the product to directly query data in an external data lake.
- Make it easier to write results externally
Users wanted writeable external tables which they could put results into with standard SQL (insert-select, etc).
These are a new class of connectors which we added to the product to address the need to make it easier to connect with Hive (and other similar products) which store their data in a data lake. With this sort of product, Kognitio wants to go directly to the lake to access the data itself for performance, but the user doesn’t want to have to define external tables at that level, preferring instead to name the higher level object (e.g. the Hive table) and have Kognitio work out where the underlying data is and how to access it.
Metadata connectors in Kognitio handle this by collecting metadata for the remote object (column names, types, file locations, file format, etc) and returning it back to the server. The existing connectors have been standardized for compatibility and have been given a way to register themselves as providers for particular protocols and file formats. Kognitio uses the information from the metadata connector to generate a new connector definition under the hood, and invoke another connector to provide access to the actual data. All of this happens in a transparent way so the user can just interact with the metadata connector and doesn’t need to know anything about file formats, transports, etc.
The new Hive connector is the first to use this functionality but we have more metadata connectors planned for future releases, including a connector for AWS Glue.
The Hive connector is very much the centerpiece of the 8.2.2 release, providing easy and seamless access to data via the Hive metastore. A Hive connector instance is automatically created on new Kognitio on Hadoop/MapR clusters and is ready to go. Accessing a Hive table through the new connector looks like this:
select ht.* from (external table from hive target 'table database.tablename') ht;
Or the user can create a permanent external table within Kognitio to reference the Hive table like this:
create external table kog_tablename from hive target 'table database.tablename';
Access to these external tables will be via the appropriate connector for the files Hive is using to store the table data. Kognitio is fully aware of any Hive partitioning, allowing the user to use Hive partition columns as regular table columns within Kognitio. Partition and column pruning will be automatically performed during queries on these external tables.
Kognitio supports connectivity to Hive tables stored using ORC, Parquet, AVRO and CSV (including use of OpenCSVSerde).
Improved partition filtering
We improved partition filtering in 8.2.2 in order to simplify the way Kognitio accesses partitioned data. The goal of this feature was to make partition filtering happen automatically and in a way which is as simple as possible and invisible to the user where possible.
Using the new partition filtering mechanism, the server identifies sets of possible values for external columns during query compilation time. This can be done from constraints explicitly in the query (e.g. ‘where datecol = date’2018-03-04’) or by extracting values from RAM-based lookup tables, for example pulling the values out of a sub-select in a query like:
select * from table where date_key in (select key from date_dim where date_val between XXX and YYY)
The extracted set of values for each column is then passed into the connector, which is able to use these to perform whatever partition elimination is possible. For data using the standard Hive partitioning scheme, the connector is already aware of partitioning columns so it can turn parts of the filename being used into column values. This works regardless of whether the files are being accessed through Hive or directly with an underlying connector. The existing connectors have all been extended to automatically use the column value sets supplied for this sort of partition column in order to filter filenames before opening them.
Writeable external tables
Writeable external tables provide Kognitio users with a very easy way of exporting results using standard SQL. We currently only support insert operations on this type of table, for delete and update the user needs to use an internal table. Creating an external table via a compatible connector is very simple, for example:
create external table tablename (c1 int, c2 varchar(100)) for insert from connector_name target 'filename /path/to/file/location';
The column names and types can be omitted if the target points at an existing external object which has metadata which the connector can extract and use. Data can then be sent externally via inserts into the external table. The writeable external API provides support for locking and transactions so multi-statement operations, rollbacks, etc can all be done so long as the connector is able to support them.
The standard Kognitio connectors do not currently support writing back so this feature currently needs to be used with a custom connector. These are simple to create, you can see an example connector with write support here. Full write support will be coming to the standard connectors later this year, please get in touch if you’d like to try a pre-release version of write support in the standard connectors before then.
Long lived Java processes
The core Kognitio product is a c/c++ engine but Java is being used increasingly to build connectors in order to take advantage of existing libraries to access things like HDFS, Parquet, ORC and Hive. In 8.2.2 we added long lived Java processes and a whole new Java connector API to make Java based connectors faster and less resource intensive.
The long lived Java process is a daemon which runs alongside the rest of the Kognitio software. Java objects are created within this process on behalf of other parts of the software, which are then able to communicate with these objects via RPC calls. This removes the need to create multiple Java virtual machine instances in the product and allows JIT, etc to be done once for each object rather than once per operation.
The existing Java-based Kognitio connectors have all been converted to use the new API and the Hive connector uses it as well. These connectors are now faster and have a much lower memory overhead as a result.
Get Kognitio today
Kognitio 8.2.2 is available now on our website. It has a much quicker and more seamless setup process, so try it out today. Visit our Getting Started page to get the latest version.