Transformers - transformers 3.0.0 documentation
pip install transformers
https://huggingface.co/transformers/glossary.html#model-inputs
[101, 232, 444, 8787, 102, 11112, 2031, 999]
CLS: 101
SEP: 102
PAD: 0
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
(後面是PAD,不用 attend。)[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
https://huggingface.co/transformers/model_doc/bert.html#berttokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# list of string -> list of id_token
encoded_dict = tokenizer.encode_plus(
"the dog meow", # First sentence to encode.
"the cat meow", # Second sentence
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 64, # Pad & truncate all sentences.
pad_to_max_length = True, # Padding
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
設定 BERT:https://huggingface.co/transformers/model_doc/bert.html#bertconfig
BERT:https://huggingface.co/transformers/model_doc/bert.html#bertmodel
bert = BertModel.from_pretrained("bert-base-uncased")
last_hidden_state, pooler_output = bert(input_ids,
token_type_ids=token_type_ids,
attention_mask=attention_mask)
# last_hidden_state[:,0,:]
# (batch, sentence len, hidden dim)
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertModel.from_pretrained('bert-base-uncased')
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple