inFamous_16 OP t1_jauvj21 wrote on March 4, 2023 at 6:26 AM

yeah thanks... That's the first thought came into my mind but isn't that way we will lose the context of original feature vector?

boosandy t1_jb0bl7b wrote on March 5, 2023 at 1:30 PM

Padding has zero attention hence you don't lose context.

inFamous_16 OP t1_jb5et2d wrote on March 6, 2023 at 3:55 PM

yeah, got it.. thank you!

[deleted] t1_jb5ey2s wrote on March 6, 2023 at 3:56 PM

[deleted]

Jaffa6 t1_jav3gj2 wrote on March 4, 2023 at 8:08 AM

I believe that you don't really lose the context because you also have an attention mask which basically says "don't pay attention to these tokens" and every pad token is masked in it.

inFamous_16 OP t1_jav6112 wrote on March 4, 2023 at 8:44 AM

Ahhh... thank you! I wasn't aware of the concept attention mask. Also I had one more doubt, As I already have tweet features of variable size after concatenation, Is there a way to skip the tokenization step because I don't require it? I only need padding and attention mask.

Jaffa6 t1_javl6ef wrote on March 4, 2023 at 12:15 PM

No problem.

I believe that if you're using a BERT-esque model, you do indeed need to do "full" tokenisation (part of which is creating the attention mask and padding) because BERT expects its input to be a list of token indices. E.g. Given the token mapping {"a": 1, "cow": 2, "cat": 3, "dog": 4}, tokenisation would turn "a cat" into [1, 3] which is in the form that BERT expects.

And since BERT comes with a token mapping (due to pre-training), if you're just putting in your own features (say, number of likes and number of retweets), they'll quite possibly just get interpreted as random tokens if their numbers match up with known token indices.

If your features are already the right kind (tokenised text, with the resultant indices matching the correct BERT token indices), I suppose you could do truncation/padding yourself and feed that input directly to BERT.

But it'll probably end up simpler and less error-prone to let BERT tokenise it for you (e.g. via HuggingFace's `AutoTokenizer.from_pretrained('bert-base')`)

inFamous_16 OP t1_javmu8a wrote on March 4, 2023 at 12:34 PM

ohh ok... super clear, Thanks for your time! I will check this out

Jaffa6 t1_javzwj6 wrote on March 4, 2023 at 2:36 PM

No worries, shoot me a message if you need a hand!

[R] Variable size input to pre-trained BERT model

Comments

I_will_delete_myself t1_jauuhhi wrote on March 4, 2023 at 6:13 AM