Machine learning is a powerful technique for analysing large-scale data and learning patterns, which provides high accuracy and shorter processing times. In this work, a machine learning algorithm (multinomial logistic regression) is used to predict the gene families from a human DNA sequence. 4380 sequences were converted into overlapping k-mers of length 6 to produce 232 414 k-mers. The data set was split into 80/20 train and test datasets, and the multinomial logistic regression model achieved a 93.9% accuracy in predicting 6 gene families within 0.24 seconds. The model was 94.8% precise, 93.9% sensitive, and had an f1-score of 94%. The developed model in this study offers an alternative approach for medical professionals to gain insights into genetic information carried within DNA segments. By leveraging machine learning techniques, accurate and efficient predictions of gene families can aid in understanding genetic characteristics and contribute to advancements in personalised medicine, diagnostics and genetic research.
|