• Data Warehouse Service

dws
  1. Help Center
  2. Data Warehouse Service
  3. Developer Guide
  4. SQL Reference
  5. SQL Syntax
  6. CREATE FOREIGN TABLE (SQL on Hadoop or OBS)

CREATE FOREIGN TABLE (SQL on Hadoop or OBS)

Function

In the current database, an HDFS or OBS foreign table is created so that you can access Hadoop structured data stored on HDFS or OBS. A foreign table is read-only. It can only be queried using SELECT.

Syntax

Create a Hadoop foreign table.

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name 
( [ { column_name type_name 
    [ { [CONSTRAINT constraint_name] NULL |
    [CONSTRAINT constraint_name] NOT NULL |
      column_constraint [...]} ] |
      table_constraint [, ...]} [, ...] ] ) 
    SERVER server_name 
    OPTIONS ( { option_name ' value ' } [, ...] ) 
    DISTRIBUTE BY {ROUNDROBIN | REPLICATION}
    [ PARTITION BY ( column_name ) [ AUTOMAPPED ] ] ;
  • column_constraint is as follows:
    [CONSTRAINT constraint_name]
    {PRIMARY KEY | UNIQUE}
    [NOT ENFORCED [ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION] | ENFORCED]
  • The table_constraint is as follows:
    [CONSTRAINT constraint_name]
    {PRIMARY KEY | UNIQUE} (column_name)
    [NOT ENFORCED [ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION] | ENFORCED]

Parameter Description

  • IF NOT EXISTS

    Does not throw an error if a table with the same name exists. A notice is issued in this case.

  • table_name

    Specifies the name of the foreign table to be created.

    Value range: a string compliant with the naming convention.

  • column_name

    Specifies the name of a column in the foreign table. Columns are separated by commas (,).

    Value range: a string compliant with the naming convention.

  • type_name

    Specifies the data type of the column.

    Supported data types include: SMALLINT, INTEGER, and BIGINT, FLOAT4 (REAL), FLOAT8(DOUBLE PRECISION), DECIMAL[p (, s)] (maximum precision: 38 decimal points), DATE, TIMESTAMP, BOOLEAN, CHAR(n), VARCHAR(n), TEXT(CLOB).

  • constraint_name

    Specifies the name of a constraint for the foreign table.

  • { NULL | NOT NULL }

    Specifies whether the column allows NULL.

    When you create a table, whether the data in HDFS is NULL or NOT NULL cannot be guaranteed. The consistency of data is guaranteed by users. Users must decide whether the column is NULL or NOT NULL. (The optimizer optimizes the NULL/NOT NULL and generates a better plan.)

  • SERVER server_name

    Specifies the server name of the foreign table. Users can customize its name.

    Value range: a string. It must comply with the naming convention rule, and the server must exist.

  • OPTIONS ( { option_name ' value ' } [, ...] )

    Specifies parameters associated with the foreign table. The types of parameter are as follows:

    • header

      Specifies whether the data file contains a table header. header is available only for CSV files.

      If header is on, the first row of the data file will be identified as the header and ignored during export. If header is off, the first row is identified as data.

      Value range: true, on, false, and off. The default value is false or off.

    • quote

      Specifies the quotation mark for the CSV format. The default value is a double quotation mark (").

      NOTE:

      The quote value cannot be the same as the delimiter or null value.

      The quote value must be a single-byte character.

      Invisible characters are recommended as quote values, such as 0x07, 0x08, and 0x1b.

    • escape

      Specifies escape characters when the CSV format is used. The value must be a single-byte character.

      The default value is double quotation marks (""). If it is the same as the quote value, it will be replaced with \0.

    • location

      File path on OBS. This is an OBS foreign table parameter. The data sources of multiple buckets are separated by vertical bars (|), for example, LOCATION 'obs://bucket1/folder/ | obs://bucket2/'. The database scans all objects in the specified folders.

    • format: The format of the source data file in the foreign table. TEXT, CSV, and ORC formats are supported.
    • foldername: directory of the source data file in the foreign table in HDFS.
    • encoding: The encoding format of source data files in foreign tables. Its default value is utf8. The HDFS foreign table is optional, and the OBS foreign table is mandatory.
    • totalrows: (Optional) estimated number of rows in a table. This parameter is used only for OBS tables. Because OBS may store many files, it is slow to analyze data. This parameter allows you to set an estimated value so that the optimizer can estimate the table size according to the value. Generally, query efficiency is high when the estimated value is close to the actual value.
    • filenames: The source data files for the foreign table. Separate multiple files with commas (,).
      NOTE:
      • You are advised to use foldername to specify the locations of data sources.
      • An absolute path in foldername should be enclosed with slashes (/). Multiple paths are separated by commas (,).
      • When you query a partitioned table, data is pruned based on partition information, and data files that meet the requirement are queried. Pruning involves scanning HDFS directory contents many times. Therefore, do not use columns with low repetition as partition column.
      • An OBS foreign table is not supported.
    • delimiter

      Specifies the column delimiter of data, and uses the default delimiter if it is not set. The default delimiter of TEXT is a tab.

      NOTE:
      • A delimiter cannot be \r or \n.
      • A delimiter cannot be the same as the null parameter.
      • A delimiter cannot contain the following characters: \.abcdefghijklmnopqrstuvwxyz0123456789
      • The data length of a single row should be less than 1 GB. If the delimiters are too long and there are too many rows, the length of valid data will be affected.
      • You are advised to use a multi-character, such as the combination of the dollar sign ($), caret (^), the ampersand (&), or invisible characters, such as 0x07, 0x08, and 0x1b as the delimiter.
      • delimiter is available only for the TEXT and CSV formats.

      Value range:

      The value of delimiter can be a multi-character delimiter whose length is less than or equal to 10 bytes.

    • null
      Specifies the string that represents a null value.
      NOTE:
      • The null value cannot be \r or \n. The maximum length is 100 characters.
      • The null parameter cannot be the same as the delimiter.
      • null is available only for the TEXT and CSV formats.

      Value range:

      • The default value is \N for the TEXT format.
    • noescaping

      Specifies in TEXT format, whether to escape the backslash (\) and its following characters.

      NOTE:

      noescaping is available only for the TEXT format.

      Value range: true, on, false, and off. The default value is false or off.

    • fill_missing_fields

      Specifies whether to generate an error message when the last column in a row in the source file is lost during data loading.

      Value range: true, on, false, and off. The default value is false or off.

      • If this parameter is set to true or on and the last column of a data row in a source data file is lost, the column is replaced with NULL and no error message will be generated.
      • If this parameter is set to false or off, and the last column is missing, the following error information will be displayed:
        missing data for column "tt"
      NOTE:
      • Because SELECT COUNT(*) does not parse columns in TEXT format, it does not report missing columns.
      • fill_missing_fields is available only for the TEXT and CSV formats.
    • ignore_extra_data

      Specifies whether to ignore excessive columns when the number of source data files exceeds the number of foreign table columns. This parameter is available only for data import.

      Value range: true, on, false, and off. The default value is false or off.

      • If this parameter is set to true or on and the number of source data files exceeds the number of foreign table columns, excessive columns will be ignored.
      • If this parameter is set to false or off and the number of source data files exceeds the number of foreign table columns, the following error information will be displayed:
        extra data after last expected column
      • If the newline character at the end of a row is lost and this parameter is set to true, data in the next row will be ignored.
      • Because SELECT COUNT(*) does not parse columns in TEXT format, it does not report missing columns.
      • ignore_extra_data is available only for the TEXT and CSV formats.
    • date_format

      Imports data of the DATE type. This syntax is available only for READ ONLY foreign tables.

      Value range: Any valid DATE value. For details, see Date and Time Processing Functions and Operators.

      NOTE:
      • If ORACLE is specified as the compatible database, the DATE format is TIMESTAMP. For details, see timestamp_format below.
      • date_format is available only for the TEXT and CSV formats.
    • time_format

      Imports data of the TIME type. This syntax is available only for READ ONLY foreign tables.

      Value range: any valid TIME value. Time zones cannot be used. For details, see Date and Time Processing Functions and Operators.

      NOTE:

      time_format is available only for the TEXT and CSV formats.

    • timestamp_format

      Imports data of the TIMESTAMP type. This syntax is available only for READ ONLY foreign tables.

      Value range: any valid TIMESTAMP value. Time zones are not supported. For details, see Date and Time Processing Functions and Operators.

      NOTE:

      timestamp_format is available only for the TEXT and CSV formats.

    • smalldatetime_format

      Imports data of the SMALLDATETIME type. This syntax is available only for READ ONLY foreign tables.

      Value range: any valid SMALLDATETIME value. For details, see Date and Time Processing Functions and Operators.

      NOTE:

      smalldatetime_format is available only for the TEXT and CSV formats.

    • checkencoding

      Specifies whether to check the character encoding.

      Valid value: low, high The default value is low.
      NOTE:

      In TEXT format, the rule of error tolerance for invalid characters imported is as follows:

      \0 is converted to a space.

      Other invalid characters are converted to question marks.

      3) Setting checkencoding to low enables invalid characters toleration. If NULL and DELIMITER are set to spaces or question marks (?), errors like "illegal chars conversion may confuse null 0x20" will be displayed, prompting you to modify parameters that may cause confusion and preventing importing errors.

      In ORC format, the rule of error tolerance for invalid characters imported is as follows:

      - If checkencoding is low, an imported field containing invalid characters will be replaced with a quotation mark string of the same length.

      - If checkencoding is high, data import stops when an invalid character is detected.

    Table 1 Support for TEXT, CSV, and ORC formats

      

    OBS

    HDFS

    Parameter

    text

    csv

    orc

    text

    csv

    orc

    location

    Supported

    Supported

    Supported

    Not supported

    Not supported

    Not supported

    format

    Supported

    Supported

    Supported

    Supported

    Supported

    Supported

    header

    Not supported

    Supported

    Not supported

    Not supported

    Supported

    Not supported

    delimiter

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    quote

    Not supported

    Supported

    Not supported

    Not supported

    Supported

    Not supported

    escape

    Not supported

    Supported

    Not supported

    Not supported

    Supported

    Not supported

    null

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    noescaping

    Supported

    Not supported

    Not supported

    Supported

    Not supported

    Not supported

    encoding

    Supported

    Supported

    Supported

    Supported

    Supported

    Supported

    fill_missing_fields

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    ignore_extra_data

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    date_format

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    time_format

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    timestamp_format

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    smalldatetime_format

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    chunksize

    Supported

    Supported

    Not supported

    Supported

    Supported

    Not supported

    filenames

    Not supported

    Not supported

    Not supported

    Supported

    Supported

    Supported

    foldername

    Supported

    Supported

    Supported

    Supported

    Supported

    Supported

    check_encoding

    Supported

    Supported

    Supported

    Supported

    Supported

    Supported

    totalrows

    Supported

    Supported

    Supported

    Not supported

    Not supported

    Not supported

  • DISTRIBUTE BY ROUNDROBIN

    Specifies ROUNDROBIN as the distribution mode for an HDFS or OBS foreign table.

  • DISTRIBUTE BY REPLICATION

    Specifies REPLICATION as the distribution mode for the HDFS foreign table.

  • PARTITION BY ( column_name ) AUTOMAPPED
    column_name specifies the partitioned table. AUTOMAPPED means the partition column specified by the HDFS partitioned foreign table is automatically mapped with the partition directory information in HDFS. The prerequisite is that the sequences of partition columns specified in the HDFS foreign table and in the directory are the same.
    NOTE:

    Partitioned tables can be used as foreign tables for HDFS, but not for OBS.

  • CONSTRAINT constraint_name

    Specifies the name of informational constraint of the foreign table.

    Value range: a string compliant with the naming convention.

  • PRIMARY KEY

    The primary key constraint specifies that one or more columns of a table must contain unique (non-duplicate) and non-null values. Only one primary key can be specified for a table.

  • UNIQUE

    Specifies that a group of one or more columns of a table must contain unique values. For the purpose of a unique constraint, NULL is not considered equal.

  • NOT ENFORCED

    Specifies the constraint to be an informational constraint. This constraint is guaranteed by the user instead of the database.

  • ENFORCED

    The default value is ENFORCED. ENFORCED is a reserved parameter and is currently not supported.

  • PRIMARY KEY (column_name)

    Specifies the informational constraint on column_name.

    Value range: a string. It must comply with the naming convention, and the value of column_name must exist.

  • ENABLE QUERY OPTIMIZATION

    Optimizes an execution plan using an informational constraint.

  • DISABLE QUERY OPTIMIZATION

    Disables the optimization of an execution plan using an informational constraint.

Informational Constraint

Context

The database system on data constraints is classified into five types, including: Not Null constraint, Unique constraint, Primary Key constraint, Foreign Key constraint, and Check constraint. When the database inserts or updates data, five constraints forcibly executed by the database may generate a large amount of system overhead and affects performance of data import or change.

If the data has in advance completed some form of constraints, and the database system also complies with those constraints, which are called informational constraints. An informational constraint is not a constraint on data from the database system. The compiler in the database can use the constraint to improve efficiency when accessing data and performing operator operations. The informational constraint is not forcibly implemented when the data is being inserted or modified, and is not used to verify data. It is used to improve query performance.

In DWS, data is stored in HDFS. DWS does not support writing data to HDFS. It is the user's responsibility to ensure enforcement of constraints. If the source data is compliant with certain informational constraint requirements, the query of such data can achieve higher efficiency.

Unique and Primary Key constraints are created with Index. Use Index Scan to scan operators during optimizer phase to improve the efficiency of data query. Because indexes cannot be used in HDFS foreign tables, informational constraints are used to optimize execution plans.

The constraints of creating informational constraints for an HDFS foreign table are as follows:

  • You can create an informational constraint only if the values in a NOT NULL column in your table are unique. Otherwise, the query result will be different from expected.
  • Currently, the informational constraint supports only PRIMARY KEY and UNIQUE constraints.
  • Informational constraints currently support only the NOT ENFORCED attribute.
  • Both an HDFS foreign table and an HDFS partitioned foreign table supports informational constraint, which is also established in a partitioned column).
  • UNIQUE informational constraints can be created for multiple columns in a table, but only one PRIMARY KEY constraint can be created in a table.
  • Multiple informational constraints can be established in a column of a table (because the function that establishing a column or multiple constraints in a column is the same.) Therefore, you are not advised to set up multiple informational constraints in a column, and only one Primary Key type can be set up.
  • Multi-column combination constraints are not supported.

Example

Example 1: In HDFS, import the TPC-H benchmark test tables part and region using Hive. The path of the part table is /user/hive/warehouse/partition.db/part_4, and that of the region table is /user/hive/warehouse/mppdb.db/region_orc11_64stripe/.

  1. Create HDFS_Server, with HDFS_FDW as the foreign data wrapper.
    -- Create HDFS_Server:
    CREATE SERVER hdfs_server FOREIGN DATA WRAPPER HDFS_FDW OPTIONS (address '10.10.0.100:25000,10.10.0.101:25000',hdfscfgpath '/opt/hadoop_client/HDFS/hadoop/etc/hadoop');
    NOTE:

    The IP addresses and port numbers of NameNodes corresponding to the HDFS cluster are written into options. There are two groups of NameNode addresses and port numbers in '10.10.0.100:25000,10.10.0.101:25000', indicating the primary and standby NameNodes of HDFS, respectively. You are advised to provide primary and standby node information in this format. Two groups of parameter values are separated by commas (,). Take '10.10.0.100:25000' as an example. In this example, the IP address is 10.10.0.100:25000, and the port number is 25000.

  2. Create an HDFS foreign table.
    -- Create an HDFS foreign table that does not contain any partition columns. The HDFS server associated with the table is hdfs_server, the corresponding file format of region on the HDFS server is 'orc', and the file directory in the HDFS file system is '/user/hive/warehouse/mppdb. db/region_orc11_64stripe/'.
    CREATE FOREIGN TABLE region
    (
        R_REGIONKEY INT4,
        R_NAME TEXT,
        R_COMMENT TEXT
    )
    SERVER
        hdfs_server
    OPTIONS
    (
        FORMAT 'orc',
        encoding 'utf8',
        FOLDERNAME '/user/hive/warehouse/mppdb.db/region_orc11_64stripe/'
    )
    DISTRIBUTE BY 
         roundrobin;
    
    -- Create an HDFS foreign table that contains partition columns.
    CREATE FOREIGN TABLE part 
    (
         p_partkey int, 
         p_name text, 
         p_mfgr text, 
         p_brand text, 
         p_type text, 
         p_size int, 
         p_container text, 
         p_retailprice float8, 
         p_comment text
    )
    SERVER
         hdfs_server
    OPTIONS
    (
         FORMAT 'orc',
         encoding 'utf8',
         FOLDERNAME '/user/hive/warehouse/partition.db/part_4'
    )
    DISTRIBUTE BY 
         roundrobin
    PARTITION BY 
         (p_mfgr) AUTOMAPPED;
    NOTE:

    DWS allows you to specify files using the keyword filenames or foldername. foldername is recommended. The key word distribute specifies the storage distribution mode of the region table.

  3. Browse the created foreign table.
    -- Browse the foreign table:
    SELECT * FROM pg_foreign_table WHERE ftrelid='region'::regclass;
     ftrelid | ftserver | ftwriteonly |                                  ftoptions
    ---------+----------+-------------+------------------------------------------------------------------------------
       16510 |    16509 | f           | {format=orc,foldername=/user/hive/warehouse/mppdb.db/region_orc11_64stripe/}
    (1 row)
    
    select * from pg_foreign_table where ftrelid='part'::regclass;
     ftrelid | ftserver | ftwriteonly |                            ftoptions
    ---------+----------+-------------+------------------------------------------------------------------
       16513 |    16509 | f           | {format=orc,foldername=/user/hive/warehouse/partition.db/part_4}
    (1 row)
  4. Modify and delete the foreign table.
    -- Modify a foreign table:
    ALTER FOREIGN TABLE region ALTER r_name TYPE TEXT;
    ALTER FOREIGN TABLE
    LTER FOREIGN TABLE region ALTER r_name SET NOT NULL;
    ALTER FOREIGN TABLE
    -- Delete the foreign table:
    DROP FOREIGN TABLE region;
    DROP FOREIGN TABLE 

  

Example 2: Operations on an HDFS foreign table that includes informational constraints

-- Create an HDFS foreign table with informational constraints
CREATE FOREIGN TABLE region  (
 R_REGIONKEY  int,
 R_NAME TEXT,
 R_COMMENT TEXT
  , primary key (R_REGIONKEY) not enforced)
SERVER hdfs_server
OPTIONS(format 'orc',
    encoding 'utf8',
 foldername '/user/hive/warehouse/mppdb.db/region_orc11_64stripe')
DISTRIBUTE BY roundrobin;

-- Check whether the region table has an informational constraint index:
SELECT relname,relhasindex FROM pg_class WHERE oid='region'::regclass;
        relname         | relhasindex 
------------------------+-------------
        region          | f
(1 row)

SELECT conname, contype, consoft, conopt, conindid, conkey FROM pg_constraint WHERE conname ='region_pkey';
   conname   | contype | consoft | conopt | conindid | conkey
-------------+---------+---------+--------+----------+--------
 region_pkey | p       | t       | t      |        0 | {1}
(1 row)

-- Delete the informational constraint:
ALTER FOREIGN TABLE region DROP CONSTRAINT region_pkey RESTRICT;

SELECT conname, contype, consoft, conindid, conkey FROM pg_constraint WHERE conname ='region_pkey';
 conname | contype | consoft | conindid | conkey 
---------+---------+---------+----------+--------
(0 rows)

-- Add a unique informational constraint for the foreign table:
ALTER FOREIGN TABLE region ADD CONSTRAINT constr_unique UNIQUE(R_REGIONKEY) NOT ENFORCED;

-- Delete the informational constraint:
ALTER FOREIGN TABLE region DROP CONSTRAINT constr_unique RESTRICT;

SELECT conname, contype, consoft, conindid, conkey FROM pg_constraint WHERE conname ='constr_unique';
 conname | contype | consoft | conindid | conkey 
---------+---------+---------+----------+--------
(0 rows)

-- Add a unique informational constraint for the foreign table:
ALTER FOREIGN TABLE region ADD CONSTRAINT constr_unique UNIQUE(R_REGIONKEY) NOT ENFORCED disable query optimization;

SELECT relname,relhasindex FROM pg_class WHERE oid='region'::regclass;
        relname         | relhasindex 
------------------------+-------------
        region          | f
(1 row)

-- Delete the informational constraint:
ALTER FOREIGN TABLE region DROP CONSTRAINT constr_unique CASCADE;

-- Delete the region table:
DROP FOREIGN TABLE region;

Example 3: Read data in OBS using a foreign table.

  1. Create obs_server, with DFS_FDW as the foreign data wrapper.
    -- Create obs_server:
    CREATE SERVER obs_server FOREIGN DATA WRAPPER DFS_FDW OPTIONS ( 
      ADDRESS 'obs.abc.com', 
       ACCESS_KEY 'xxxxxxxxx', 
      SECRET_ACCESS_KEY 'yyyyyyyyyyyyy', 
      TYPE 'OBS'
    );
    NOTE:
    • ADDRESS indicates the IP address or domain name of OBS. Replace its value as required. You can find the domain name by searching for the value of regionCode in the region_map file.
    • ACCESS_KEY and SECRET_ACCESS_KEY are access keys for the cloud account system. Replace their values as required.
    • TYPE indicates the server type. Retain the value OBS.
  2. Create an OBS foreign table.
    --- Create the customer_address foreign table that does not contain partition columns. obs_server is the associated OBS server. Files on this server are in .orc format and stored in the user/hive/warehouse/mppdb.db/region_orc11_64stripe1/ directory.
    CREATE FOREIGN TABLE customer_address
    (
        ca_address_sk             integer               not null,
        ca_address_id             char(16)              not null,
        ca_street_number          char(10)                      ,   
        ca_street_name            varchar(60)                   ,   
        ca_street_type            char(15)                      ,   
        ca_suite_number           char(10)                      ,   
        ca_city                   varchar(60)                   ,   
        ca_county                 varchar(30)                   ,   
        ca_state                  char(2)                       ,   
        ca_zip                    char(10)                      ,   
        ca_country                varchar(20)                   ,   
        ca_gmt_offset             decimal(36,33)                  ,   
        ca_location_type          char(20)    
    ) 
    SERVER obs_server OPTIONS (
        FOLDERNAME '/user/hive/warehouse/mppdb.db/region_orc11_64stripe1/',
        FORMAT 'ORC',
        ENCODING 'utf8',
        TOTALROWS  '20'
    )
    DISTRIBUTE BY roundrobin;
  3. Query data stored in OBS using a foreign table.
    -- Browse the foreign table:
    SELECT COUNT(*) FROM customer_address;
     count 
    -------
        20
    (1 row)
  4. Delete the foreign table.
    -- Delete the foreign table:
    DROP FOREIGN TABLE customer_address;
    DROP FOREIGN TABLE