-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hi!
I’m experiencing some issues training a custom SquiggleNet model. Would it be possible to check whether I’m doing something wrong?
Classification using your pretrained models, on human and E. coli R9 reads
First I tried running SquiggleNet inference using your pretrained models, on human and E. coli nanopore R9 data. This gave very nice results: an accuracy of 83-86% (depending on the model).
I did assume that human = 1 and bacterial = 0 in these models. Is this correct?
Classification using a custom model, on human and SARS-CoV2 R9 reads
Then I tried training a model using human and SARS-CoV2 data. Classifying the test data using this custom model, resulted in an accuracy of only 3%.
I am not sure whether I executed your scripts correctly. It would be very much appreciated if you wanted to check whether I ran your scripts as intended.
-
Splitting up the data into training, validation and test datasets
- I used an equal amount of target (= SARS-CoV2) and non-target (=human) reads (269507 reads)
- 80% of the reads were randomly selected and allocated to the training dataset, another 10% were allocated to the validation dataset, the remaining reads were allocated to the test dataset
- the remaining reads (not included in the 269507 reads I started with) were also allocated to the test dataset -
Preprocessing the training data
-python ./SquiggleNet/preprocess.py -gp sarscov2_train_readids.txt -gn human_train_readids.txt -i fast5_human_and_sarscov2 -o outfolder_train
- this resulted in 21neg_*.ptand15 pos_*.ptdata batches -
Preprocessing the validation data
-python ./SquiggleNet/preprocess.py -gp sarscov2_val_readids.txt -gn human_val_readids.txt -i fast5_human_and_sarscov2 -o outfolder_val
- this resulted in 2neg_*.ptand 1pos_*.ptdata batches -
Training a custom model
- I was confused as to how thetrainer.pyscript had to be executed. The preprocessing-script resulted in multiple pytorch tensors, but only one file can be specified at once to execute the trainer-script I think?
- I used 15 batches (each iteration one target, one non-target) to train the model. I used the--intermediateoption to finetune the previous model after the first iteration. Is this how the trainer script is intended to be used?
-python ./SquiggleNet/trainer.py -tt outfolder_train/pos_10000.pt -nt outfolder_train/neg_10000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -o trainedModel_b1.ckpt -e 3
-python ./SquiggleNet/trainer.py -tt outfolder_train/pos_20000.pt -nt outfolder_train/neg_20000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -i ./trainedModel_b1.ckpt -e 3 -o trainedModel_b2.ckpt
-[…]
-python ./SquiggleNet/trainer.py -tt outfolder_train/pos_150000.pt -nt outfolder_train/neg_150000.pt -tv outfolder_val/pos_10000.pt -nv outfolder_val/neg_10000.pt -i ./trainedModel_b14.ckpt -e 3 -o trainedModel_b15.ckpt -
Classifying the test data using the custom model
-python ./SquiggleNet/inference.py -m trainedModel_b16.ckpt -i fast5_human_and_sarscov2_testdata/ -o classification_results_trainedModel_b16/
- only 3% of the reads were allocated to the correct class
Any help would be greatly appreciated!