tokenizer.batch_encode_plus 淡淡的烟草味﹌ 2022-09-05 12:59 170阅读 0赞 注释是输出 tokenizer = BertTokenizer.from_pretrained('C:\\Users\\lgy\\Desktop\\fsdownload\\bert-base-uncased') print(tokenizer.mask_token) # [MASK] print(tokenizer.convert_tokens_to_ids('a')) # 1037 print(tokenizer.convert_ids_to_tokens(1037)) # a string = "test batch encode plus" strings = [string,string] tokens = tokenizer.tokenize(string) print(tokens)#['test', 'batch', 'en', '##code', 'plus'] out = tokenizer.batch_encode_plus(strings,max_length=10,padding='max_length',truncation='longest_first')#长的截,短的补 print(out)# {'input_ids': [[101, 3231, 14108, 4372, 16044, 4606, 102, 0, 0, 0], [101, 3231, 14108, 4372, 16044, 4606, 102, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}
还没有评论,来说两句吧...