Sqoop 之常用命令与防坑指南

发表于 2018-11-13 分类于技术一路走到黑阅读次数：本文字数： 4.6k 阅读时长 ≈ 4 分钟

Sqoop 的功能十分强大，可以帮助你完成不同数据库或数据仓库之间的数据同步任务。

总所周知的是 Sqoop 有两个版本

1.4.x
1.99.x

它们一个代表着 sqoop1，一个代表着sqoop2，它们功能性上的异同可简单归纳为以下几点，其余差异不在本文做过多的赘述。

功能	Sqoop 1	Sqoop 2
用于所有主要 RDBMS 的连接器	支持	不支持解决办法：使用已在以下数据库上执行测试的通用 JDBC 连接器： Microsoft SQL Server 、 PostgreSQL 、 MySQL 和 Oracle 。
Kerberos 安全集成	支持	不支持
数据从 RDBMS 传输至 Hive 或 HBase	支持	不支持解决办法：按照此两步方法操作。将数据从 RDBMS 导入 HDFS 在 Hive 中使用相应的工具和命令（例如 LOAD DATA 语句），手动将数据载入 Hive 或 HBase
数据从 Hive 或 HBase 传输至 RDBMS	不支持解决办法：按照此两步方法操作。从 Hive 或 HBase 将数据提取至 HDFS （作为文本或 Avro 文件）使用 Sqoop 将上一步的输出导出至 RDBMS	不支持按照与 Sqoop 1 相同的解决方法操作

关于两者在其他方面的异同可以参考此文：Sqoop1和Sqoop2的刨析对比

出发

本文的背景环境为 sqoop 1.4.6

sqoop 的命令格式十分简单，只需要往上累加需要的参数即可

$ sqoop help
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import mainframe datasets to HDFS
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.

这篇文章叙述的主要对象为 create hive table 、sqoop import ，它们不仅仅是我最为常用的命令，也是 sqoop 家族里使用频率比较高的命令。

通用命令参数

该表格记录了一些命令的通用操作选项：

Argument	Description
`--connect <jdbc-uri>`	Specify JDBC connect string
`--connection-manager <class-name>`	Specify connection manager class to use
`--driver <class-name>`	Manually specify JDBC driver class to use
`--hadoop-mapred-home <dir>`	Override $HADOOP_MAPRED_HOME
`--help`	Print usage instructions
`--password-file`	Set path for a file containing the authentication password
`-P`	Read password from console
`--password <password>`	Set authentication password
`--username <username>`	Set authentication username
`--verbose`	Print more information while working
`--connection-param-file <filename>`	Optional properties file that provides connection parameters
`--relaxed-isolation`	Set connection transaction isolation to read uncommitted for the mappers.

从参数名也可以看出，都是访问过程中必要的连接信息。

Sqoop Create Hive Table

如果你想复制源数据库的表结构至目标库，那就试试这条指令：

1	$ sqoop create-hive-table (generic-args) (create-hive-table-args)

Hive 参数

Argument	Description
`--hive-home <dir>`	Override `$HIVE_HOME`
`--hive-overwrite`	Overwrite existing data in the Hive table.
`--create-hive-table`	If set, then the job will fail if the target hive
	table exits. By default this property is false.
`--hive-table <table-name>`	Sets the table name to use when importing to Hive.
`--table`	The database table to read the definition from.

输出格式定义参数

Argument	Description
`--enclosed-by <char>`	Sets a required field enclosing character
`--escaped-by <char>`	Sets the escape character
`--fields-terminated-by <char>`	Sets the field separator character
`--lines-terminated-by <char>`	Sets the end-of-line character
`--mysql-delimiters`	Uses MySQL’s default delimiter set: fields: `,` lines: `\n` escaped-by: `\` optionally-enclosed-by: `'`
`--optionally-enclosed-by <char>`	Sets a field enclosing character

示例

其实也只需要指定相关数据库库的库名表名也就可以了

$ sqoop create-hive-table 
    --connect jdbc:mysql://127.0.0.1:3306/tmp \
    --table employees \
    --hive-table emps \

Sqoop import

一个完整的 Sqoop 命令需要由多个参数拼接而成，包括 jdbc-uri、输入输出的数据存储地址、用户名、密码以及更为复杂的可选参数等等。

Import 参数

数据导入是个相对复杂的操作，所以参数也比单纯建表多一些：

Argument	Description
`--append`	Append data to an existing dataset in HDFS
`--as-avrodatafile`	Imports data to Avro Data Files
`--as-sequencefile`	Imports data to SequenceFiles
`--as-textfile`	Imports data as plain text (default)
`--as-parquetfile`	Imports data to Parquet Files
`--boundary-query <statement>`	Boundary query to use for creating splits
`--columns <col,col,col…>`	Columns to import from table
`--delete-target-dir`	Delete the import target directory if it exists
`--direct`	Use direct connector if exists for the database
`--fetch-size <n>`	Number of entries to read from database at once.
`--inline-lob-limit <n>`	Set the maximum size for an inline LOB
`-m,--num-mappers <n>`	Use n map tasks to import in parallel
`-e,--query <statement>`	Import the results of statement.
`--split-by <column-name>`	Column of the table used to split work units. Cannot be used with `--autoreset-to-one-mapper` option.
`--autoreset-to-one-mapper`	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with `--split-by <col>` option.
`--table <table-name>`	Table to read
`--target-dir <dir>`	HDFS destination dir
`--warehouse-dir <dir>`	HDFS parent for table destination
`--where <where clause>`	WHERE clause to use during import
`-z,--compress`	Enable compression
`--compression-codec <c>`	Use Hadoop codec (default gzip)
`--null-string <null-string>`	The string to be written for a null value for string columns
`--null-non-string <null-string>`	The string to be written for a null value for non-string columns

说说几个比较关键的参数解读

–split-by

若源数据表中没有制定主键，将会导致执行失败，若是指定 --split-by 参数，程序便可根据所指定的字段进行 map reduce 切分。当然，由于 --split-by 指定的字段可能是无序甚至重复或不连续的，所以做分片的时间会非常久。

–as-parquetfile

textfile
avrodatafile
sequencefile
parquetfile

这个参数和其他几个参数一样，主要是选择存储文件的格式

关于 Hive 文件格式的区别，可以参考这篇文章

Apache Hive Different File Formats:TextFile,SequenceFile,RCFile,AVRO,ORC,Parquet

–compression-codec

选择压缩解码器，若是选用 Snappy Compression 可使用这个参数

1	--compression-codec org.apache.hadoop.io.compress.SnappyCodec

可参考 CDH 的官方文档

Using Snappy Compression in Sqoop 1 and Sqoop 2 Imports

–fields-terminated-by

字段分隔符，数据传输难免会遇到字符串类型，这便会造成很多令你感到奇怪的事情

比如说字符串中自带换行符或其他字符，这些字符可能你看的见，也可能看不见

导致的后果是行数多了几行或者翻了几倍，这种情况很大可能就是换行符搞的鬼

这时你可以尝试使用下面的参数，就有很大可能可以解决你的燃眉之急

1 2	--fields-terminated-by '\001' --fields-terminated-by '\0001'

关于换行符对 Hive 造成影响的原因和解决方案强烈推荐参阅这篇文章

How a newline can ruin your Hive

–lines-terminated-by

有列分隔符便有行分隔符，直接用 \n 就行了，能 hold 住大部分场景，这也是 Sqoop 默认的选项

1	--lines-terminated-by '\n'

–hive-overwrite

有了这个参数，就可以覆盖同步了，免去了手动 drop 表的操作

示例

可以做为模板使用

$ sqoop import --connect jdbc:mysql://127.0.0.1:3306/tmp   \
  --username amos   \
  --password amos_tmp   \
  --table user_account_statement   \
  --hive-import  --hive-overwrite   \
  --hive-database tmp   \
  --hive-table user_account_statement   \
  --hive-drop-import-delims --fields-terminated-by '\001' --lines-terminated-by '\n'  \
  --compression-codec org.apache.hadoop.io.compress.SnappyCodec   \
  --as-parquetfile

自由格式查询导入

相比上面的传统做法，还有一种更猛的，可以实现数据的高度定制。

就是使用 SQL 来实现结果集导入，其实上面部分参数的存在意义也就是拼凑出一段 SQL 而已。

$ sqoop import \
  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE a.create_time>1542097292' \
  --split-by a.id \
  --target-dir /user/amos/joinresults