Submitted by inFamous_16 t3_11hq1ga in deeplearning
inFamous_16 OP t1_jav6112 wrote
Reply to comment by Jaffa6 in [R] Variable size input to pre-trained BERT model by inFamous_16
Ahhh... thank you! I wasn't aware of the concept attention mask. Also I had one more doubt, As I already have tweet features of variable size after concatenation, Is there a way to skip the tokenization step because I don't require it? I only need padding and attention mask.
Jaffa6 t1_javl6ef wrote
No problem.
I believe that if you're using a BERT-esque model, you do indeed need to do "full" tokenisation (part of which is creating the attention mask and padding) because BERT expects its input to be a list of token indices. E.g. Given the token mapping {"a": 1, "cow": 2, "cat": 3, "dog": 4}, tokenisation would turn "a cat" into [1, 3] which is in the form that BERT expects.
And since BERT comes with a token mapping (due to pre-training), if you're just putting in your own features (say, number of likes and number of retweets), they'll quite possibly just get interpreted as random tokens if their numbers match up with known token indices.
If your features are already the right kind (tokenised text, with the resultant indices matching the correct BERT token indices), I suppose you could do truncation/padding yourself and feed that input directly to BERT.
But it'll probably end up simpler and less error-prone to let BERT tokenise it for you (e.g. via HuggingFace's `AutoTokenizer.from_pretrained('bert-base')`)
inFamous_16 OP t1_javmu8a wrote
ohh ok... super clear, Thanks for your time! I will check this out
Jaffa6 t1_javzwj6 wrote
No worries, shoot me a message if you need a hand!
Viewing a single comment thread. View all comments