User:Brime1977

Input features for prediction

Any classification algorithm requires a set of fixed length of input features for training, thus necessitating a strategy for encapsulating the global information about peptides of variable length in a fixed length format. The fixed length format was obtained from peptide sequences of variable length using amino acid composition (vector size 20), dipeptide composition (vector size 400) and binary profile of pattern (vector size 20 for each residue). The formulae to calculate these features are described elsewhere19,33.

Two sample logos

The two sample logos were generated using online two sample logo software34. These logos provide the position specific frequency of amino acids in a peptide. Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position while the height of symbols within the stack indicates the relative frequency of each amino acid at that position.

Quantitative Matrix

Positional preferences of residues have been used and represented earlier in the literature in the form of quantitative matrices (QMs)15. QMs show the propensity of each amino acid/dipeptide/property at each position in the positive (hemolytic) and negative (non-hemolytic) datasets. Three types of QMs have been generated:

Single residue-based

For each position, fraction of each amino acid is calculated for both positive and negative datasets. The final propensity value in each cell represents the subtraction of fraction of a particular residue in negative dataset from the fraction of that residue in the positive dataset. Thus a resultant matrix is of dimension N × M is formed, where N and M represent rows (number of single residues) and columns (number of positions; first 30 positions are taken into account) respectively. So in HemoPI, at single amino acid level, QM has dimensions of 20 × 30.

Dipeptide-based

Here, for each position, fraction of each possible dipeptide is calculated for both positive and negative datasets and final propensity value represents the subtraction of fraction of a particular dipeptide in negative dataset from the fraction of that dipeptide in the positive dataset. Thus a resultant matrix is of dimensions 20 × 29.

Property-based

Here, for each position, fraction of residues falling into a particular category of physicochemical properties (11 categories in total) is calculated for both positive and negative datasets. The final propensity value represents the subtraction of fraction of those residues in a particular class in negative dataset from the fraction of residues in positive dataset. Thus the resultant matrix is of dimensions 11 × 30. Above three types of matrices have been calculated for both HemoPI-1 and HemoPI-2 datasets.

Hybrid approach

For better and biologically reliable prediction, we integrated motif-based approach with the machine learning-based method. We used “Motif—EmeRging and with Classes—Identification” (MERCI) software35 to extract motifs exclusively present in hemolytic and non-hemolytic peptides. In the case of HemoPI-1 dataset, we extracted motifs exclusive to hemolytic peptides as the negative dataset of HemoPI-1 comprises of random peptides from Swiss-Prot while in case of HemoPI-2 dataset, motifs exclusive to hemolytic as well as non-hemolytic peptides were extracted. The motifs were extracted only from main datasets and not from validation datasets. In order to obtain relevant motifs, we used only those motifs which were present at least in ten seqeunces in the dataset. To predict the test sequence as hemolytic or non-hemolytic, we added +1 or −1 in the SVM score if the motif is present in the test sequence from the hemolytic or non-hemolytic peptides respectively.

Performance measures

A five-fold cross-validation technique was used to classify hemolytic peptides on main datasets. For performance evaluation on validation datasets, the model was developed on main dataset and ios app installs on validation dataset. Standard performance parameters like Sensitivity (Sn), Specificity (Sp), Accuracy (Acc) and Matthews correlation coefficient (MCC), were used to evaluate the performance of the method. The formula to calculate these parameters are described elsewhere.