= "microsoft/deberta-v3-small"
chosen_pretrained_model
from transformers import AutoModelForSequenceClassification,AutoTokenizer
= AutoTokenizer.from_pretrained(chosen_pretrained_model) debv3_tokenizer
1. Import a Pretrained Language Model
1.1 Look Inside the Language Model
print(debv3_tokenizer)
2. Test out Tokenizer
= ("Hey all! What's going on? It's Tony from Sydney!")
test_string debv3_tokenizer.tokenize(test_string)
3. Import Competition Data
To add relevant competition data to your kaggle “Input” folder.
This “Input” folder is persistent when you submit to the competition. All other folders created during prior to submitting are disregarded.
3.1 Via GUI:
- On Kaggle, Go to [Add Data]
- Filter for “Competition Datasets”
- Search “US Patents”
- Click [Add Competition]
3.2 Via Programatically:
Note: You’ll need your own GPU’s, I don’t so the rest of the notebook is ran on the Kaggle website 1. Have kaggle login + keys ready locally, explained in this post 2. Run code to download data locally.
from pathlib import Path
= Path('us-patent-phrase-to-phrase-matching')
path if not path.exists():
import zipfile,kaggle
str(path))
kaggle.api.competition_download_cli(f'{path}.zip').extractall(path) zipfile.ZipFile(
3.3 Look Inside the Competition Data
= Path('/kaggle/input/us-patent-phrase-to-phrase-matching') # Using GUI places comp-data into 'kaggle/input' folder
path import pandas as pd
= pd.read_csv(path/'train.csv')
df ='object') df.describe(include
4. Data Preparation
4.1 Create Input Column
Create a contentated column of imporatant columns context
, target
and anchor
.
'input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df['input'] df[
4.2 Convert Pandas Dataframe to HuggingFace Dataset
from datasets import Dataset,DatasetDict
= Dataset.from_pandas(df)
hf_datasets hf_datasets.keys
4.3 Tokenize our HuggingFace Dataset
Using the tokenizer, we can apply pre-trained model to our new concatenated column.
A hugging face dataset is in the form of a dictionary so we can index to get a column with dict['column']
We can apply the tokenization with batching, resulting in an additional few columns input_ids
, token_type_ids
, attention_marks
, which only took 2 seconds!
def tok_func(x): return debv3_tokenizer(x["input"])
= hf_datasets.map(tok_func, batched=True)
tok_ds tok_ds
4.4 Rename the Columns as to what HF expects
= tok_ds.rename_columns({'score':'labels'}) tok_ds
4.5 Training and Validation Sets
Split the above tokenized hugging face datasets into validation and training sets, into DatasetDict
s.
Note: The validation set here is called test and not validate
= tok_ds.train_test_split(0.25, seed=42)
tok_ds_dicts tok_ds_dicts
5. Data Modelling
5.1 Import libraries and set parameters
Import modules: - TrainingArgument
: to take in all the hyperparameters - Trainer
class: combines the TrainingArguments and Pre-trained model Set the main hyper-parameters: - Batch Sizes: to fit on the GPU, - Number of Epochs: for each ‘experiment’ and the - Learning Rate, so it doesnt fail.
[“Future Iteration”]: More descriptions on these and other parameters in future posts.
from transformers import TrainingArguments,Trainer
= 128
bs = 4
epochs = 8e-5 lr
5.2 Setup Training Arguments
= TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
args ="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
evaluation_strategy=epochs, weight_decay=0.01, report_to='none') num_train_epochs
5.3 Create Model
= AutoModelForSequenceClassification.from_pretrained(chosen_pretrained_model,
model =1,
num_labels=True) ignore_mismatched_sizes
5.4 Create Metrics Functions
The Pearson coefficient using numpy.
import numpy as np
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}
5.5 Create Trainer
= Trainer(model,
trainer
args, =tok_ds_dicts['train'],
train_dataset=tok_ds_dicts['test'],
eval_dataset=debv3_tokenizer,
tokenizer=corr_d) compute_metrics
5.6 Do the Training
trainer.train()
6. Predictions
Now that we have a Trainer (same as Learner in FastAI), we could use it on a an unseen set of data such as a Test Set and make predictions.
6.1 Import Test Dataset
= pd.read_csv(path/'test.csv')
eval_df 'input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_df[= Dataset.from_pandas(eval_df).map(tok_func, batched=True) eval_ds
6.2 Make Predictions
Predictions are going beyond 0 and 1
= trainer.predict(eval_ds).predictions.astype(float) preds
6.3 Clip Predictions
Predictions are going beyond 0 and 1
= np.clip(preds, 0, 1) preds
7. Submission
import datasets
= datasets.Dataset.from_dict({
submission 'id': eval_ds['id'],
'score': preds
})
'submission.csv', index=False) submission.to_csv(
8. Part 2
Actually the submissiong won’t work because it is a Notebook competition is the Internet is Turned Off.
What needs to be done is convert this version to one that works without installing anything from the internet.
That would be in Part 2.