Huggingface | Notion

Transformers - transformers 3.0.0 documentation

pip install transformers

model input

https://huggingface.co/transformers/glossary.html#model-inputs

Input IDs
- token indices.
- list of long. [101, 232, 444, 8787, 102, 11112, 2031, 999]
- 常見的 CLS: 101 SEP: 102 PAD: 0
Attention mask
- 哪些 token 會被 attend 到
- [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0](後面是PAD，不用 attend。)
Token Type IDs
- CLS, SEP
- 第一句和第二句的位置
- [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
Position IDs
- Transformer 沒有位置關係，所以加了一些東西來讓模型知道token的順序

Tokenizer

https://huggingface.co/transformers/model_doc/bert.html#berttokenizer

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# list of string -> list of id_token
encoded_dict = tokenizer.encode_plus(
                        "the dog meow",    # First sentence to encode.
                        "the cat meow",    # Second sentence
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,  # Padding
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                )

BERT

設定 BERT：https://huggingface.co/transformers/model_doc/bert.html#bertconfig

BERT：https://huggingface.co/transformers/model_doc/bert.html#bertmodel

bert = BertModel.from_pretrained("bert-base-uncased")
last_hidden_state, pooler_output = bert(input_ids, 
											                  token_type_ids=token_type_ids, 
											                  attention_mask=attention_mask)

# last_hidden_state[:,0,:]
# (batch, sentence len, hidden dim)

Example

>>> import torch

>>> tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
>>> model = BertModel.from_pretrained('bert-base-uncased')

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple