Spatial Indexes¶
Quick Example¶
create table IF NOT EXISTS carbonTable
(
COLUMN1 BIGINT,
LONGITUDE BIGINT,
LATITUDE BIGINT,
COLUMN2 BIGINT,
COLUMN3 BIGINT
)
STORED AS carbondata
TBLPROPERTIES ('SPATIAL_INDEX.mygeohash.type'='geohash','SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude','SPATIAL_INDEX.mygeohash.originLatitude'='39.850713','SPATIAL_INDEX.mygeohash.gridSize'='50','SPATIAL_INDEX.mygeohash.minLongitude'='115.828503','SPATIAL_INDEX.mygeohash.maxLongitude'='720.000000','SPATIAL_INDEX.mygeohash.minLatitude'='39.850713','SPATIAL_INDEX.mygeohash.maxLatitude'='720.000000','SPATIAL_INDEX'='mygeohash','SPATIAL_INDEX.mygeohash.conversionRatio'='1000000','SORT_COLUMNS'='column1,column2,column3,latitude,longitude');
Introduction to Spatial Indexes¶
Spatial data includes multidimensional points, lines, rectangles, cubes, polygons, and other geometric objects. A spatial data object occupies a certain region of space, called spatial scope, characterized by its location and boundary. The spatial data can be either point data or region data.
Point data: A point has a spatial extent characterized completely by its location. It does not occupy space and has no associated boundary. Point data consists of a collection of points in a two-dimensional space. Points can be stored as a pair of longitude and latitude.
Region data: A region has a spatial extent with a location, and boundary. The location can be considered as the position of a fixed point in the region, such as its centroid. In two dimensions, the boundary can be visualized as a line (for finite regions, a closed loop). Region data contains a collection of regions.
Currently, only point data is supported, and it can be stored.
Longitude and latitude can be encoded as a unique GeoID. Geohash is a public-domain geocoding system invented by Gustavo Niemeyer. It encodes geographical locations into a short string of letters and digits. It is a hierarchical spatial data structure which subdivides the space into buckets of grid shape, which is one of the many applications of what is known as the Z-order curve, and generally the space-filling curve.
The Z value of a point in multiple dimensions is calculated by interleaving the binary representation of its coordinate value, as shown in the following figure. When Geohash is used to create a GeoID, data is sorted by GeoID instead of longitude and latitude. Data is stored by spatial proximity.
Creating a Table¶
GeoHash encoding:
create table IF NOT EXISTS carbonTable
(
...
`LONGITUDE` BIGINT,
`LATITUDE` BIGINT,
...
)
STORED AS carbondata
TBLPROPERTIES ('SPATIAL_INDEX.mygeohash.type'='geohash','SPATIAL_INDEX.mygeohash.sourcecolumns'='longitude, latitude','SPATIAL_INDEX.mygeohash.originLatitude'='xx.xxxxxx','SPATIAL_INDEX.mygeohash.gridSize'='xx','SPATIAL_INDEX.mygeohash.minLongitude'='xxx.xxxxxx','SPATIAL_INDEX.mygeohash.maxLongitude'='xxx.xxxxxx','SPATIAL_INDEX.mygeohash.minLatitude'='xx.xxxxxx','SPATIAL_INDEX.mygeohash.maxLatitude'='xxx.xxxxxx','SPATIAL_INDEX'='mygeohash','SPATIAL_INDEX.mygeohash.conversionRatio'='1000000','SORT_COLUMNS'='column1,column2,column3,latitude,longitude');
SPATIAL_INDEX is a user-defined index handler. This handler allows users to create new columns from the table-structure column set. The new column name is the same as that of the handler name. The type and sourcecolumns properties of the handler are mandatory. Currently, the value of type supports only geohash. Carbon provides a default implementation class that can be easily used. You can extend the default implementation class to mount the customized implementation class of geohash. The default handler also needs to provide the following table properties:
SPATIAL_INDEX.xxx.originLatitude: specifies the origin latitude. (Double type.)
SPATIAL_INDEX.xxx.gridSize: specifies the grid length in meters. (Int type.)
SPATIAL_INDEX.xxx.minLongitude: specifies the minimum longitude. (Double type.)
SPATIAL_INDEX.xxx.maxLongitude: specifies the maximum longitude. (Double type.)
SPATIAL_INDEX.xxx.minLatitude: specifies the minimum latitude. (Double type.)
SPATIAL_INDEX.xxx.maxLatitude: specifies the maximum latitude. (Double type.)
SPATIAL_INDEX.xxx.conversionRatio: used to convert the small value of the longitude and latitude to an integer. (Int type.)
You can add your own table properties to the handlers in the above format and access them in your custom implementation class. originLatitude, gridSize, and conversionRatio are mandatory. Other parameters are optional in Carbon. You can use the SPATIAL_INDEX.xxx.class property to specify their implementation classes.
The default implementation class can generate handler column values for sourcecolumns in each row and support query based on the sourcecolumns filter criteria. The generated handler column is invisible to users. Except the SORT_COLUMNS table properties, no DDL commands or properties are allowed to contain the handler column.
Note
By default, the generated handler column is regarded as the sorting column. If SORT_COLUMNS does not contain any sourcecolumns, add the handler column to the end of the existing SORT_COLUMNS. If the handler column has been specified in SORT_COLUMNS, its order in SORT_COLUMNS remains unchanged.
If SORT_COLUMNS contains any sourcecolumns but does not contain the handler column, the handler column is automatically inserted before sourcecolumns in SORT_COLUMNS.
If SORT_COLUMNS needs to contain any sourcecolumns, ensure that the handler column is listed before the sourcecolumns so that the handler column can take effect during sorting.
GeoSOT encoding:
CREATE TABLE carbontable(
...
longitude DOUBLE,
latitude DOUBLE,
...)
STORED AS carbondata
TBLPROPERTIES ('SPATIAL_INDEX'='xxx',
'SPATIAL_INDEX.xxx.type'='geosot',
'SPATIAL_INDEX.xxx.sourcecolumns'='longitude, latitude',
'SPATIAL_INDEX.xxx.level'='21',
'SPATIAL_INDEX.xxx.class'='org.apache.carbondata.geo.GeoSOTIndex')
Parameter | Description |
---|---|
SPATIAL_INDEX | Specifies the spatial index. Its value is the same as the column name. |
SPATIAL_INDEX.xxx.type | (Mandatory) The value is set to geosot. |
SPATIAL_INDEX.xxx.sourcecolumns | (Mandatory) Specifies the source columns for calculating the spatial index. The value must be two existing columns of the double type. |
SPATIAL_INDEX.xxx.level | (Optional) Specifies the columns for calculating the spatial index. The default value is 17, through which you can obtain an accurate result and improve the computing performance. |
SPATIAL_INDEX.xxx.class | (Optional) Specifies the implementation class of GeoSOT. The default value is org.apache.carbondata.geo.GeoSOTIndex. |
Example:
create table geosot(
timevalue bigint,
longitude double,
latitude double)
stored as carbondata
TBLPROPERTIES ('SPATIAL_INDEX'='mygeosot',
'SPATIAL_INDEX.mygeosot.type'='geosot',
'SPATIAL_INDEX.mygeosot.level'='21', 'SPATIAL_INDEX.mygeosot.sourcecolumns'='longitude, latitude');
Preparing Data¶
Data file 1: geosotdata.csv
timevalue,longitude,latitude 1575428400000,116.285807,40.084087 1575428400000,116.372142,40.129503 1575428400000,116.187332,39.979316 1575428400000,116.337069,39.951887 1575428400000,116.359102,40.154684 1575428400000,116.736367,39.970323 1575428400000,116.720179,40.009893 1575428400000,116.346961,40.13355 1575428400000,116.302895,39.930753 1575428400000,116.288955,39.999101 1575428400000,116.17609,40.129953 1575428400000,116.725575,39.981115 1575428400000,116.266922,40.179415 1575428400000,116.353706,40.156483 1575428400000,116.362699,39.942444 1575428400000,116.325378,39.963129
Data file 2: geosotdata2.csv
timevalue,longitude,latitude 1575428400000,120.17708,30.326882 1575428400000,120.180685,30.326327 1575428400000,120.184976,30.327105 1575428400000,120.189311,30.327549 1575428400000,120.19446,30.329698 1575428400000,120.186965,30.329133 1575428400000,120.177481,30.328911 1575428400000,120.169713,30.325614 1575428400000,120.164563,30.322243 1575428400000,120.171558,30.319613 1575428400000,120.176365,30.320687 1575428400000,120.179669,30.323688 1575428400000,120.181001,30.320761 1575428400000,120.187094,30.32354 1575428400000,120.193574,30.323651 1575428400000,120.186192,30.320132 1575428400000,120.190055,30.317464 1575428400000,120.195376,30.318094 1575428400000,120.160786,30.317094 1575428400000,120.168211,30.318057 1575428400000,120.173618,30.316612 1575428400000,120.181001,30.317316 1575428400000,120.185162,30.315908 1575428400000,120.192415,30.315871 1575428400000,120.161902,30.325614 1575428400000,120.164306,30.328096 1575428400000,120.197093,30.325985 1575428400000,120.19602,30.321651 1575428400000,120.198638,30.32354 1575428400000,120.165421,30.314834
Importing Data¶
The GeoHash default implementation class extends the customized index abstract class. If the handler property is not set to a customized implementation class, the default implementation class is used. You can extend the default implementation class to mount the customized implementation class of geohash. The methods of the customized index abstract class are as follows:
Init method: Used to extract, verify, and store the handler property. If the operation fails, the system throws an exception and displays the error information.
Generate method: Used to generate indexes. It generates an index for each row of data.
Query method: Used to generate an index value range list for given input.
The commands for importing data are the same as those for importing common Carbon tables.
LOAD DATA inpath '/tmp/geosotdata.csv' INTO TABLE geosot OPTIONS ('DELIMITER'= ',');
LOAD DATA inpath '/tmp/geosotdata2.csv' INTO TABLE geosot OPTIONS ('DELIMITER'= ',');
Note
For details about geosotdata.csv and geosotdata2.csv, see Preparing Data.
Aggregate Query of Irregular Spatial Sets¶
Query statements and filter UDFs
Filtering data based on polygon
IN_POLYGON(pointList)
UDF input parameter
Parameter
Type
Description
pointList
String
Enter multiple points as a string. Each point is presented as longitude latitude. Longitude and latitude are separated by a space. Each pair of longitude and latitude is separated by a comma (,). The longitude and latitude values at the start and end of the string must be the same.
UDF output parameter
Parameter
Type
Description
inOrNot
Boolean
Checks whether data is in the specified polygon_list.
Example:
select longitude, latitude from geosot where IN_POLYGON('116.321011 40.123503, 116.137676 39.947911, 116.560993 39.935276, 116.321011 40.123503');
Filtering data based on the polygon list
IN_POLYGON_LIST(polygonList, opType)
UDF input parameters
Parameter
Type
Description
polygonList
String
Inputs multiple polygons as a string. Each polygon is presented as POLYGON ((longitude1 latitude1, longitude2 latitude2, …)). Note that there is a space after POLYGON. Longitudes and latitudes are separated by spaces. Each pair of longitude and latitude is separated by a comma (,). The longitudes and latitudes at the start and end of a polygon must be the same. IN_POLYGON_LIST requires at least two polygons.
Example:
POLYGON ((116.137676 40.163503, 116.137676 39.935276, 116.560993 39.935276, 116.137676 40.163503))
opType
String
Performs union, intersection, and subtraction on multiple polygons.
Currently, the following operation types are supported:
OR: A U B U C (Assume that three polygons A, B, and C are input.)
AND: A ∩ B ∩ C
UDF output parameter
Parameter
Type
Description
inOrNot
Boolean
Checks whether data is in the specified polygon_list.
Example:
select longitude, latitude from geosot where IN_POLYGON_LIST('POLYGON ((120.176433 30.327431,120.171283 30.322245,120.181411 30.314540, 120.190509 30.321653,120.185188 30.329358,120.176433 30.327431)), POLYGON ((120.191603 30.328946,120.184179 30.327465,120.181819 30.321464, 120.190359 30.315388,120.199242 30.324464,120.191603 30.328946))', 'OR');
Filtering data based on the polyline list
IN_POLYLINE_LIST(polylineList, bufferInMeter)
UDF input parameters
Parameter
Type
Description
polylineList
String
Inputs multiple polylines as a string. Each polyline is presented as LINESTRING (longitude1 latitude1, longitude2 latitude2, …). Note that there is a space after LINESTRING. Longitudes and latitudes are separated by spaces. Each pair of longitude and latitude is separated by a comma (,).
A union will be output based on the data in multiple polylines.
Example:
LINESTRING (116.137676 40.163503, 116.137676 39.935276, 116.260993 39.935276)
bufferInMeter
Float
Polyline buffer distance, in meters. Right angles are used at the end to create a buffer.
UDF output parameter
Parameter
Type
Description
inOrNot
Boolean
Checks whether data is in the specified polyline_list.
Example:
select longitude, latitude from geosot where IN_POLYLINE_LIST('LINESTRING (120.184179 30.327465, 120.191603 30.328946, 120.199242 30.324464, 120.190359 30.315388)', 65);
Filtering data based on the GeoID range list
IN_POLYGON_RANGE_LIST(polygonRangeList, opType)
UDF input parameters
Parameter
Type
Description
polygonRangeList
String
Inputs multiple rangeLists as a string. Each rangeList is presented as RANGELIST (startGeoId1 endGeoId1, startGeoId2 endGeoId2, …). Note that there is a space after RANGELIST. Start GeoIDs and end GeoIDs are separated by spaces. Each group of GeoID ranges is separated by a comma (,).
Example:
RANGELIST (855279368848 855279368850, 855280799610 855280799612, 855282156300 855282157400)
opType
String
Performs union, intersection, and subtraction on multiple rangeLists.
Currently, the following operation types are supported:
OR: A U B U C (Assume that three rangeLists A, B, and C are input.)
AND: A ∩ B ∩ C
UDF output parameter
Parameter
Type
Description
inOrNot
Boolean
Checks whether data is in the specified polyRange_list.
Example:
select mygeosot, longitude, latitude from geosot where IN_POLYGON_RANGE_LIST('RANGELIST (526549722865860608 526549722865860618, 532555655580483584 532555655580483594)', 'OR');
Performing polygon query
IN_POLYGON_JOIN(GEO_HASH_INDEX_COLUMN, POLYGON_COLUMN)
Perform join query on two tables. One is a spatial data table containing the longitude, latitude, and GeoHashIndex columns, and the other is a dimension table that saves polygon data.
During query, IN_POLYGON_JOIN UDF, GEO_HASH_INDEX_COLUMN, and POLYGON_COLUMN of the polygon table are used. Polygon_column specifies the column containing multiple points (longitude and latitude pairs). The first and last points in each row of the Polygon table must be the same. All points in each row form a closed geometric shape.
UDF input parameters
Parameter
Type
Description
GEO_HASH_INDEX_COLUMN
Long
GeoHashIndex column of the spatial data table.
POLYGON_COLUMN
String
Polygon column of the polygon table, the value of which is represented by the string of polygon, for example, POLYGON (( longitude1 latitude1, longitude2 latitude2, ...)).
Example:
CREATE TABLE polygonTable( polygon string, poiType string, poiId String) STORED AS carbondata; insert into polygonTable select 'POLYGON ((120.176433 30.327431,120.171283 30.322245, 120.181411 30.314540,120.190509 30.321653,120.185188 30.329358,120.176433 30.327431))','abc','1'; insert into polygonTable select 'POLYGON ((120.191603 30.328946,120.184179 30.327465, 120.181819 30.321464,120.190359 30.315388,120.199242 30.324464,120.191603 30.328946))','abc','2'; select t1.longitude,t1.latitude from geosot t1 inner join (select polygon,poiId from polygonTable where poitype='abc') t2 on in_polygon_join(t1.mygeosot,t2.polygon) group by t1.longitude,t1.latitude;
Performing range_list query
IN_POLYGON_JOIN_RANGE_LIST(GEO_HASH_INDEX_COLUMN, POLYGON_COLUMN)
Use the IN_POLYGON_JOIN_RANGE_LIST UDF to associate the spatial data table with the polygon dimension table based on Polygon_RangeList. By using a range list, you can skip the conversion between a polygon and a range list.
UDF input parameters
Parameter
Type
Description
GEO_HASH_INDEX_COLUMN
Long
GeoHashIndex column of the spatial data table.
POLYGON_COLUMN
String
Rangelist column of the Polygon table, the value of which is represented by the string of rangeList, for example, RANGELIST (startGeoId1 endGeoId1, startGeoId2 endGeoId2, ...).
Example:
CREATE TABLE polygonTable( polygon string, poiType string, poiId String) STORED AS carbondata; insert into polygonTable select 'RANGELIST (526546455897309184 526546455897309284, 526549831217315840 526549831217315850, 532555655580483534 532555655580483584)','xyz','2'; select t1.* from geosot t1 inner join (select polygon,poiId from polygonTable where poitype='xyz') t2 on in_polygon_join_range_list(t1.mygeosot,t2.polygon);
UDFs of spacial index tools
Obtaining row number and column number of a grid converted from GeoID
GeoIdToGridXy(geoId)
UDF input parameter
Parameter
Type
Description
geoId
Long
Calculates the row number and column number of the grid based on GeoID.
UDF output parameter
Parameter
Type
Description
gridArray
Array[Int]
Returns the grid row and column numbers contained in GeoID in array. The first digit indicates the row number, and the second digit indicates the column number.
Example:
select longitude, latitude, mygeohash, GeoIdToGridXy(mygeohash) as GridXY from geoTable;
Converting longitude and latitude to GeoID
LatLngToGeoId(latitude, longitude oriLatitude, gridSize)
UDF input parameters
Parameter
Type
Description
longitude
Long
Longitude. Note: The value is an integer after conversion.
latitude
Long
Latitude. Note: The value is an integer after conversion.
oriLatitude
Double
Origin latitude, required for calculating GeoID.
gridSize
Int
Grid size, required for calculating GeoID.
UDF output parameter
Parameter
Type
Description
geoId
Long
Returns a number that indicates the longitude and latitude after coding.
Example:
select longitude, latitude, mygeohash, LatLngToGeoId(latitude, longitude, 39.832277, 50) as geoId from geoTable;
Converting GeoID to longitude and latitude
GeoIdToLatLng(geoId, oriLatitude, gridSize)
UDF input parameters
Parameter
Type
Description
geoId
Long
Calculates the longitude and latitude based on GeoID.
oriLatitude
Double
Origin latitude, required for calculating the longitude and latitude.
gridSize
Int
Grid size, required for calculating the longitude and latitude.
Note
GeoID is generated based on the grid coordinates, which are the grid center. Therefore, the calculated longitude and latitude are the longitude and latitude of the grid center. There may be an error ranging from 0 degree to half of the grid size between the calculated longitude and latitude and the longitude and latitude of the generated GeoID.
UDF output parameter
Parameter
Type
Description
latitudeAndLongitude
Array[Double]
Returns the longitude and latitude coordinates of the grid center that represent the GeoID in array. The first digit indicates the latitude, and the second digit indicates the longitude.
Example:
select longitude, latitude, mygeohash, GeoIdToLatLng(mygeohash, 39.832277, 50) as LatitudeAndLongitude from geoTable;
Calculating the upper-layer GeoID of the pyramid model
ToUpperLayerGeoId(geoId)
UDF input parameter
Parameter
Type
Description
geoId
Long
Calculates the upper-layer GeoID of the pyramid model based on the input GeoID.
UDF output parameter
Parameter
Type
Description
geoId
Long
Returns the upper-layer GeoID of the pyramid model.
Example:
select longitude, latitude, mygeohash, ToUpperLayerGeoId(mygeohash) as upperLayerGeoId from geoTable;
Obtaining the GeoID range list using the input polygon
ToRangeList(polygon, oriLatitude, gridSize)
UDF input parameters
Parameter
Type
Description
polygon
String
Input polygon string, which is a pair of longitude and latitude.
Longitude and latitude are separated by a space. Each pair of longitude and latitude is separated by a comma (,). The longitude and latitude at the start and end must be the same.
oriLatitude
Double
Origin latitude, required for calculating GeoID.
gridSize
Int
Grid size, required for calculating GeoID.
UDF output parameter
Parameter
Type
Description
geoIdList
Buffer[Array[Long]]
Converts polygons into GeoID range lists.
Example:
select ToRangeList('116.321011 40.123503, 116.137676 39.947911, 116.560993 39.935276, 116.321011 40.123503', 39.832277, 50) as rangeList from geoTable;
Calculating the upper-layer longitude of the pyramid model
ToUpperLongitude (longitude, gridSize, oriLat)
UDF input parameters
Parameter
Type
Description
longitude
Long
Input longitude, which is a long integer.
gridSize
Int
Grid size, required for calculating longitude.
oriLatitude
Double
Origin latitude, required for calculating longitude.
UDF output parameter
Parameter
Type
Description
longitude
Long
Returns the upper-layer longitude.
Example:
select ToUpperLongitude (-23575161504L, 50, 39.832277) as upperLongitude from geoTable;
Calculating the upper-layer latitude of the pyramid model
ToUpperLatitude(Latitude, gridSize, oriLat)
UDF input parameters
Parameter
Type
Description
latitude
Long
Input latitude, which is a long integer.
gridSize
Int
Grid size, required for calculating latitude.
oriLatitude
Double
Origin latitude, required for calculating latitude.
UDF output parameter
Parameter
Type
Description
Latitude
Long
Returns the upper-layer latitude.
Example:
select ToUpperLatitude (-23575161504L, 50, 39.832277) as upperLatitude from geoTable;
Converting longitude and latitude to GeoSOT
LatLngToGridCode(latitude, longitude, level)
UDF input parameters
Parameter
Type
Description
latitude
Double
Latitude.
longitude
Double
Longitude.
level
Int
Level. The value range is [0, 32].
UDF output parameter
Parameter
Type
Description
geoId
Long
A number that indicates the longitude and latitude after GeoSOT encoding.
Example:
select LatLngToGridCode(39.930753, 116.302895, 21) as geoId;