TensorFlow Text 一文读懂-蒲公英云

官网：https://github.com/tensorflow/text

引言Introduction

TensorFlow Text提供了一组与TensorFlow 2.0共同使用的文本相关类和操作。这个库可以基于文本模型，定期执行预处理，以及核心TensorFlow不提供的序列建模的功能。

在文本预处理中使用这些操作的好处是它们在TensorFlow Graph中完成，因此无需担心训练中的标记化与推测的标记化或管理预处理脚本不同。

安装 Installation

pip install -U tensorflow-text

Eager Execution

TensorFlow Text兼容TensorFlow eager模式和graph模式。

import tensorflow as tf
import tensorflow_text as text
tf.enable_eager_execution()

编码 Unicode

大多数操作所期望的字符串编码方式都是UTF-8。如果使用了不同的编码方式，则可以使用核心tensorflow的转码操作将字符串转码为UTF-8。如果输入的结构无效，也可以使用相同的操作将字符串强制转换为结构有效的UTF-8。

docs = tf.constant([u'Everything not saved will be lost.'.encode('UTF-16-BE'),
                    u'Sad☹'.encode('UTF-16-BE')])
utf8_docs = tf.strings.unicode_transcode(docs, input_encoding='UTF-16-BE',
                                         output_encoding='UTF-8')

规范化 Normalization

在处理不同的文本来源时，最重要的是不同来源的相同单词能被识别为相同。 Unicode中不区分大小写匹配的常用技术是大小写折叠（类似于下框）。（请注意，案例折叠在内部应用NFKC标准化。）
我们还提供Unicode规范化操作，用于将字符串转换为字符的规范表示，其中规范化形式KC是默认值（NFKC）。

print(text.case_fold_utf8(['Everything not saved will be lost.']))
print(text.normalize_utf8(['Äffin']))
print(text.normalize_utf8(['Äffin'], 'nfkd'))
tf.Tensor(['everything not saved will be lost.'], shape=(1,), dtype=string)
tf.Tensor(['\xc3\x84ffin'], shape=(1,), dtype=string)
tf.Tensor(['A\xcc\x88ffin'], shape=(1,), dtype=string)

切词 Tokenization

分词是将字符串分解为单个单词的过程。通常，这些标记是单词、数字和/或标点符号。

主要的接口是tokenizer和tokenizerWithOffset，它们分别有一个方法tokenize和tokenizeWithOffset。

可以使用多个实现标记器，其中的每一个都实现了tokenizerWithOffset（它扩展了tokenizer），其中包括一个将字节偏移量获取到原始字符串中的选项。

这允许调用者知道token创建的原始字符串中的字节。

所有的标记器都返回raggedtensors，其中最内部的标记维度映射到原始的单个字符串。结果，生成的形状的秩增加了一。

如果对这方面不太熟悉，可以参考ragged tensor指南。https://www.tensorflow.org/guide/ragged_tensors

WhitespaceTokenizer

这是一个基本的标记器，它在icu定义的空白字符（如空格、制表符、换行符）上拆分utf-8字符串。

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\xe2\x98\xb9']]

UnicodeScriptTokenizer

此标记生成器根据Unicode脚本边界拆分UTF-8字符串。使用的脚本代码对应于Unicode的国际组件（ICU）UScriptCode值。请参阅：http：//icu-project.org/apiref/icu4c/uscript_8h.html

在实践中，这种方法与WhitespaceTokenizer类似。最明显的区别在于它将从语言文本（例如USCRIPT_LATIN，USCRIPT_CYRILLIC等）中分割标点符号（USCRIPT_COMMON），同时还将语言文本彼此分开。

tokenizer = text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]

Unicode split

当对没有空格的语言进行标记以分割单词时，通常只需按字符分割，这可以使用core中的unicode分割操作来完成。

tokens = tf.strings.unicode_split([u"仅今年前".encode('UTF-8')], 'UTF-8')
print(tokens.to_list())
[['\xe4\xbb\x85', '\xe4\xbb\x8a', '\xe5\xb9\xb4', '\xe5\x89\x8d']]

Offsets

在对字符串进行标记时，通常需要知道标记源自原始字符串的位置。出于这个原因，实现TokenizerWithOffsets的每个标记生成器都有一个tokenize_with_offsets方法，该方法将返回字节偏移量以及标记。 offset_starts列出每个标记开始时原始字符串中的字节，offset_limits列出每个标记结束的字节。

tokenizer = text.UnicodeScriptTokenizer()
(tokens, offset_starts, offset_limits) = tokenizer.tokenize_with_offsets(
    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())
print(offset_starts.to_list())
print(offset_limits.to_list())
[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],
 ['Sad', '\xe2\x98\xb9']]
[[0, 11, 15, 21, 26, 29, 33], [0, 3]]
[[10, 14, 20, 25, 28, 33, 34], [3, 6]]

TF.Data Example

标记生成器使用tf.data API按预期工作。下面提供了一个简单的例子。

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],
                                           ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = tokenized_docs.make_one_shot_iterator()
print(iterator.get_next().to_list())
print(iterator.get_next().to_list())
[['Never', 'tell', 'me', 'the', 'odds.']]
[["It's", 'a', 'trap!']]

Other Text Ops

TF.Text打包其他有用的预处理操作。我们将在下面回顾几个。

Wordshape

在一些自然语言理解模型中使用的常见特征是查看文本字符串是否具有某个属性。例如，句子破坏模型可能包含检查单词大写或者标点符号位于字符串末尾的功能。

Wordshape定义了各种有用的基于正则表达式的辅助函数，用于匹配输入文本中的各种相关模式。这里有一些例子。

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])
# Is capitalized?
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)
# Are all letters uppercased?
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)
# Does the token contain punctuation?
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)
# Is the token a number?
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)
print(f1.to_list())
print(f2.to_list())
print(f3.to_list())
print(f4.to_list())
[[True, False, False, False, False, False], [True]]
[[False, False, False, False, False, False], [False]]
[[False, False, False, False, False, True], [True]]
[[False, False, False, False, False, False], [False]]

N-grams & Sliding Window

给定滑动窗口大小为n的N-gram是连续的单词。组合令牌时，支持三种减少机制。对于文本，您可能希望使用Reduction.STRING_JOIN将字符串相互追加。默认分隔符是一个空格，但可以使用string_separater参数进行更改。

另外两种还原方法通常与数值一起使用，它们是Reduction.SUM和Reduction.MEAN。

tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.',
                             u'Sad☹'.encode('UTF-8')])
# Ngrams, in this case bi-gram (n = 2)
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)
print(bigrams.to_list())
[['Everything not', 'not saved', 'saved will', 'will be', 'be lost.'], []]