Skip to main content
  • Research Article
  • Open access
  • Published:

Data-intensive analysis of HIV mutations

Abstract

Background

In this study, clustering was performed using a bitmap representation of HIV reverse transcriptase and protease sequences, to produce an unsupervised classification of HIV sequences. The classification will aid our understanding of the interactions between mutations and drug resistance. 10,229 HIV genomic sequences from the protease and reverse transcriptase regions of the pol gene and antiretroviral resistant related mutations represented in an 82-dimensional binary vector space were analyzed.

Results

A new cluster representation was proposed using an image inspired by microarray data, such that the rows in the image represented the protein sequences from the genotype data and the columns represented presence or absence of mutations in each protein position.The visualization of the clusters showed that some mutations frequently occur together and are probably related to an epistatic phenomenon.

Conclusion

We described a methodology based on the application of a pattern recognition algorithm using binary data to suggest clusters of mutations that can easily be discriminated by cluster viewing schemes.

Background

The human immunodeficiency virus (HIV) shows extensive genetic variability that helps the selection of drug resistance mutations in response to antiretroviral therapy. Hence, it is important to understand the relationship between HIV genotype and phenotype (i.e., drug resistance) to increase the probability of treatment success.

To infer antiretroviral resistance, look-up tables [1,2] and rule-based systems [3,4] were developed by different groups to infer phenotypic resistance based on HIV genomic sequences from infected patients that failed on antiretroviral therapy. In Brazil, a look-up table [2] was developed and used by the Brazilian Ministry of Health AIDS program to help the decision-making process for antiretroviral salvage therapy (http://algoritmo.aids.gov.br/).

In Brazil, patients who fail on antiretroviral therapy receive genotype tests for antiretroviral resistance throughout a network of laboratories [5]. This collection of HIV genomic sequences represents the variability of the HIV population in this country. With this extensive amount of data, questions arise as to whether it is possible to classify the sequences, based on the occurrences of resistance-related mutations in the different amino acid positions, and whether it is possible to achieve a classification that can express current knowledge of the relationship between mutations and drug resistance.

One possible way to answer these questions is to apply clustering algorithms on reverse transcriptase and protease sequences, to obtain clusters containing sequences that are similar. This similarity among the sequences may reveal some of the relationships among the mutations related to antiretroviral resistance.

Nonetheless, extraction of a simple and compact representation of the dataset is complex because of the number and size of sequences. The clusters thus generated may provide a representation that contributes to the understanding of the classification and the relationships between mutations.

In the present study, a pipeline (see Figure 1) was introduced to represent clusters inspired by microarray data, in which extensive amounts of data are available. Microarray data were used as inspiration because such applications typically contain large volumes of information on gene patterns from thousands of genes at once. Thus, clusters were represented in an image corresponding to a matrix, such that the rows in the image represented each protein sequence and the columns indicated the presence or absence of resistance-related mutations. This image enabled us to summarize the dataset without losing any information about clustering, permitting the observation of important characteristics of each cluster and enabling cluster comparison, thus providing insights into the data.

Figure 1
figure 1

Pipeline summarizing the proposed framework. 1) Protease and reverse transcriptase sequences were gathered from patients from all over Brazil, 2) binarization of the sequences, 3) clustering of the mutations, 4) characterization of the clusters and 5) comparison with the Brazilian look-up-table predictions.

Previous studies have attempted to identify common protease and reverse transcriptase mutation patterns [6-15] (as shown in Tables 1, 2 and 3). However, many previous works search only for pairs of mutations, not being able to find larger mutation patterns, which are known to exist [11,16-21]. Furthermore, frequently, only subtype B virus sequences are used, and mutations occur with different probabilities in the different subtypes [22]. Also, in some of the previous works a small number of protein positions are used. Consequently, not all mutation patterns in the data are found and it is more difficult to compare results. Finally, small datasets used in some of the related works do not represent all of the virus population variability, also missing mutation patterns. Therefore, there is no clear consensus on which are the important mutation patterns that arise in the protein sequences.

Table 1 Related works
Table 2 Related works
Table 3 Related works

Nonetheless, some patterns have been reported in previous works such as the simultaneous presence of mutations at positions 30 and 88 of the protease [7,9-12,23], selected by nelfinavir [24]. The same applies to thymidine analog mutations (TAMs) in reverse transcriptase, which can be discriminated in TAM1 and TAM2 profiles [11,16-21]. The TAM1 profile presents mutations at codons 41, 210 and 215, whereas TAM2 presents mutation at codons 67, 70, and 219.

Such studies on mutation patterns are important because the co-existence of mutations may result in different antiretroviral resistance profiles. For example, a mutation can restore the fitness decrease from another mutation that confers drug resistance. However, some of the previous studies only investigated pairs of mutations, and most of them only analyzed subtype B HIV-1 sequences. Moreover, previous studies analyzed specific mutation profiles, making it difficult to compare results between different studies. Thus, mutation patterns have not been fully characterized in the protease and reverse transcriptase sequences. Characterization of these patterns may lead to a better understanding of the interactions among these mutations and to classification of the sequences.

In the present study, a large number of codons (38 from reverse transcriptase and 44 from protease, as shown in Table 4) from subtypes B, C and F were clustered, and the sequences were classified according to the mutation patterns. These clusters were compared with clusters reported in other studies.

Table 4 Protease and reverse transcriptase amino acid positions considered in the present study

Look up tables and rule-based systems

Based on genotype-phenotype correlation studies on laboratory HIV-1 isolates, genotype-phenotype correlations on clinical isolates and genotype-treatment history correlations [25], some efforts have been made to try to understand the relationship between HIV genotype and phenotype. For example, look-up tables [1,2,26] have been compiled using information from the scientific literature, which has been turned into rules in which the occurrences of mutations, or combinations of mutations, are correlated with drug resistance. In addition to look-up tables, some rule-based systems [3,4] have created scoring systems to calculate the likelihood of therapy failure, which are also based on published data. Look-up tables and rule-based systems are efforts to correlate the set of known mutations with the potential for drug resistance. Both represent current knowledge concerning the relationships between virus genotype and drug resistance and its application. Look-up tables and rule-based systems group mutations into clusters of mutations, thereby predicting the possible result of drug treatment.

Clustering

Similar to the classifications retrieved from look-up tables and rule-based systems, pattern recognition methods are designed to extract information from data to classify them. In cases where little prior information is available and the decision-maker must make as few assumptions as possible about the data, the clustering technique is useful [27].

By applying clustering algorithms to reverse transcriptase and protease sequences, clusters containing sequences that are similar to each other are obtained. The clusters may contain sequences with similar drug response patterns. Applying clustering algorithms, and comparing the clusters with the classifications from look-up tables will achieve a better understanding of the relationship between genotype and phenotype.

In addition to providing comparisons with look-up tables, clusters also allow hypothesizes regarding the occurrences of mutations to be formed. Therefore, such analysis can show which mutations have higher probability of occurring together and those that may influence each other.

One of the best-known algorithms for clustering is K-means, which is popular because the time complexity is O(n), where n is the number of patterns [27]. The time complexity makes this a good choice when dealing with a large volume of data, which was the case here.

Methods

Pipeline

Figure 1 summarizes the methodology used in this work to analyze the protease and reverse transcriptase sequences. First, HIV genomic sequences from patients from 27 Brazilian states were extracted from the national database and binarized according to the presence or absence of mutations. The sequences were clustered and an image was created to represent the clusters. The clusters were characterized given the occurrence of mutations and compared with the prediction of drug resistance from the Brazilian look-up table.

The scripts created for data clustering (step 3) and cluster representation (steps 4 and 5) are available at http://www.ime.usp.br/~mcintho/.

Sequence representation

In the present study, 10,229 reverse transcriptase and protease sequences from HIV subtype B, 801 from subtype F and 424 from subtype C, were obtained from the national database. These samples were taken in accordance with the ethics standards of the Ethics Committee of the Federal University of São Paulo and with the Helsinki Declaration of 1975, revised in 1983. All biological samples were obtained in full accordance with signed informed consent forms (process number in research ethics committee 1433/09).

The Brazilian Guidelines for Resistance Testing allowed only one genotype testing for each patient at the time the sequences were generated; therefore, duplication of the sequences from the same patient was not expected.

To simplify the representation and comparison of the reverse transcriptase andprotease sequences, bitmap mapping was used. In this technique, if a sequence hadthe same amino acid as the wild-type sequence, it was replaced with the value zeroand when the sequence had a different amino acid, it was replaced by the value1, as previously described (Reuman et al. [8] and Melikian et al. [28]). Thus, the sequences could be interpreted as binary vectors in and 99 dimensional spaces (amino acids from reverse transcriptase and 99 from protease).

When working with patterns of high dimensionality, the “curse of high dimensionality” must be avoided. The “curse of high dimensionality” makes all distances look alike in high dimensional spaces [29] and makes it difficult to evaluate similarity. One way to avoid it is to decrease the dimensionality of the data.

To escape the “curse of high dimensionality”, 38 positions from reverse transcriptase and 44 positions from protease 4 known to be related to drug resistance were analyzed [2,25].

K-means

In an attempt to classify reverse transcriptase and protease sequences using a pattern recognition algorithm, we applied K-means from the R Project for Statistical Computing [30] to the 10,229 sequences. Sequences were divided according to HIV subtype and genomic region. Thus, K-means was used to search for clusters in the protease and reverse transcriptase sequences from subtypes B, C and F, separately. The algorithm was repeated 10 times for each of the datasets, with random centroids. The value of k, i.e. the number of clusters to be retrieved, ranged from 2 to 16.

Cluster characterization

One problem that arose from generating the clusters was how to view and interpret them in the domain of HIV mutations, which was caused by the large number of sequences and amino acid positions used in our analysis. Images can be used to solve this problem because they provide an intuitive information visualization tool to support and validate the results, and to formulate and test hypotheses. When the research entails data-intensive analysis, the use of images becomes even more important, because the volume of data makes it difficult to manipulate and visualize the data directly. Thus, images can help in the analysis process and can summarize the data and results.

Therefore, to analyze the clusters, observe whether they followed any mutation patterns and to determine what these patterns might be, images of the clusters were created inspired by microarray data visualization. Binary images (i.e. black and white) represented the binary sequences featured as rows and the amino acid positions as columns: 44 columns for protease and 38 columns for reverse transcriptase. The sequences were grouped according to clusters and separated by blue lines. When a sequence had the value of 1 in an amino acid position, it would be represented by a black pixel, and when it had a value of 0, it would be represented by a white pixel. Six images were created for each value of k, combining the proteins and subtypes.

The black and white pixels were useful for distinguishing the clusters, accentuating differences among them and describing them, as well as for summarizing the information within the sequences and clusters. They also helped to view the amino acid positions that represented and characterized the clusters.

To provide more details about the clusters, histograms were plotted for each cluster, for protease and reverse transcriptase, showing the percentage of sequences in the cluster with mutations in each position. Each bar in the histogram represented an amino acid position and the percentage of sequences in the cluster with a mutation at that position.

To compare the clusters with the look-up table used to interpret the genotypic resistance from the Brazilian algorithm for resistance interpretation, another image was generated. The HIVDAG software [31] was used to create this other image. HIVDAG interprets the rules in the Brazilian look-up table in the context of the sequences and produces a prediction regarding antiretroviral resistance. The software classifies the sequences as resistant (R), intermediate resistant (I) and susceptible (S).

To represent the three possible results, red, yellow and green were used for resistance, intermediate resistance and susceptibility, respectively. Thus, as in the binary figure, the rows featured the protein sequences and the columns were the predictions for drug resistance given by the look-up table for that sequence.

In these colored images, vertical lines presenting a dominant color in each cluster indicated that the sequences in that cluster have the same drug resistance prediction. Clusters that showed red, yellow or green vertical lines in different positions indicated that there was some correspondence between the prediction of the look-up table and the K-means clusters.

Results and discussion

For distinct k values, the sequences were distributed in different clusters; black and white images were created for each combination of subtype, k value and protein. Figures 2 and 3 represent the clusters for subtype B, where k = 6 for protease and reverse transcriptase, respectively. The value of k=6 was chosen because it represents better the current knowledge of mutation occurrence and mutation relationships. For k = 6, both TAM groups and the mutation profile comprising substitutions on protease codons 30 and 88 are represented. Nonetheless, as k values progressed, the clusters were first divided into groups of sequences with many mutations and with few or no mutations. For each increase in the k value, the group with many mutations was repeatedly split, although stability and consistency were maintained.

Figure 2
figure 2

Black and white figure of kmeans clusters for subtype B sequences of the HIV protease. The figure displays the different mutation patterns characterizing each subtype B protease cluster. The columns in the figure represent the amino acid positions selected to the clustering and the rows, the protein sequences. Blue lines delimit the six classes, the black pixels represent mutations and the white pixels the absence of mutations. The number identifying each cluster is on the left and the number of the sequences in the cluster on the right.

Figure 3
figure 3

Black and white figure of kmeans clusters for subtype B sequences of the HIV reverse transcriptase. The figure displays the different mutation patterns characterizing each subtype B reverse transcriptase cluster. The columns in the figure represent the amino acid positions selected for clustering and the rows represent the protein sequences. Blue lines delimit the six classes, the black pixels represent mutations and the white pixels represent the absence of mutations. The number identifying each cluster is on the left and the number of the sequences in the cluster on the right.

K-medoids have been used in a previous study [14] for clustering a smaller number of subtype B sequences. In order to evaluate this alternative clustering method, it has been applied to the dataset here described. The K-medoids implementation available at [32] has been adopted and Figures 4 and 5 shows the clustering results. As it can be seen, the results are similar to those shown in Figures 2 and 3, except for clusters B6.4, B6.5 and B6.1 from protease and clusters B6.2 and B6.5 from reverse transcriptase.They contain sequences that are predicted to be susceptible to most of the drugs and do not represent patterns of mutations. This difference is probably because although both algorithms are related, k-medoids represents clusters by the median of cluster points, instead of the mean [33]. But, except for these differences, both methods lead to similar results, which corrobotate our findings.

Figure 4
figure 4

Black and white figure of k-medoids clusters for subtype B sequences of the HIV protease. The figure displays the different mutation patterns characterizing each subtype B protease cluster. The columns in the figure represent the amino acid positions selected to the clustering and the rows, the protein sequences. Blue lines delimit the six classes, the black pixels represent mutations and the white pixels the absence of mutations.

Figure 5
figure 5

Black and white figure of k-medoids clusters for subtype B sequences of the HIV reverse transcriptase. The figure displays the different mutation patterns characterizing each subtype B reverse transcriptase cluster. The columns in the figure represent the amino acid positions selected for clustering and the rows represent the protein sequences. Blue lines delimit the six classes, the black pixels represent mutations and the white pixels represent the absence of mutations.

To characterize the clusters, the histograms shown in Figures 6 and 7 for subtype B and k = 6, for protease and reverse transcriptase, respectively, were produced. These histograms display the percentage occurrence of mutations at each amino acid position for each cluster. The mutations that had higher percentages defined the clusters and determined which cluster the sequences belonged to. Those that had high frequencies in one cluster and low frequencies in the others enabled differentiation between the sequences and between the clusters. Additionally, the positions with higher frequencies of mutations in a cluster were those that occurred together more frequently, and their occurrences were considered as related.

Figure 6
figure 6

Histogram showing the frequency of mutations in the protease kmeans clusters. Histograms containing the frequencies of mutations for each selected amino acid position in protease for each of the six clusters in subtype B at k=6. Each histogram represents one cluster found by K-means for k=6 in the protease sequences. Each bar in the histogram represents a protein position and the percentage of sequences in the cluster that contain a mutation at that position.

Figure 7
figure 7

Histogram showing the frequency of mutations in reverse transcriptase kmeans clusters. Histograms containing the frequencies of mutations for each selected amino acid position in the reverse transcriptase for each of the six clusters in subtype B at k=6. Each histogram represents one cluster found by K-means for k=6 in the reverse transcriptase sequences. Each bar in the histogram represents a protein position and the percentage of sequences in the cluster that contain a mutation at that position.

To compare the clusters with the predictions of drug resistance given by the rules in the Brazilian look-up table, colored images were created. The images from the protease clusters (see Figure 8 at k = 6) showed division of the sequences into groups that were sensitive to the majority of the drugs and other groups that were resistant to the majority of the drugs. However, the reverse transcriptase clusters showed different combinations of predictions for different clusters, with similar predictions for sequences in the same cluster and different predictions for sequences in different clusters (see Figure 9).

Figure 8
figure 8

Colored figure of the kmeans clusters for subtype B sequences of the HIV protease. The figure displays the predictions of drug resistance from the Brazilian look-up table for each cluster. The columns in the colored figure represent the nine drugs selected (ATV/R, DRV/R, FPV/R, IDV/R, LPV/R, SQV/R and TPV/R, in that order) and the rows represent the protein sequences. Black lines delimit the classes. The number identifying each cluster is on the left and the number of the sequences in the cluster on the right.

Figure 9
figure 9

Colored figure of the kmeans clusters for subtype B sequences of the HIV reverse transcriptase. The columns in the colored figure represent the nine drugs selected (3TC, ABC, AZT, d4T, ddI, TDF, EFV, ETV and NVP, in that order) and the rows represent the protein sequences. Black lines delimit the classes. The number identifying each cluster is on the left and the number of the sequences in the cluster is on the right.

As seen in Figures 2 and 3, the clusters had different mutation profiles for the two proteins. K-means successfully distinguished the sequences and grouped them according to the different mutations, indicating that it is possible to obtain a classification for HIV protein sequences using clustering algorithms, according to the occurrences of the mutations.

The different occurrence patterns for the mutations are emphasized in Figures 6 and 7, which show the distinct percentages of mutations present at each protein position and at each cluster for subtype B. Some positions are important for the characterization and description of the clusters, such as positions 10, 82 and 90 of the protease, and 67, 70 and 219 of the reverse transcriptase.

Additionally, K-means was able to produce clusters that correlated with different predictions of drug resistance, especially for the reverse transcriptase (see Figure 9). The figures show that although clusters were found for both proteins, reverse transcriptase clusters display more patterns of prediction of drug resistance. As protease gene variation is higher than for reverse transcriptase gene in non-treated patients, the pathways for a strain to become resistance are more limited in reverse transcriptase as compared to the protease. Therefore, we believe that the constrains for variation in the reverse transcriptase gene facilitate the detection of the clusters.

The results for subtypes C and F are summarized in Tables 5 and 6. Tables 5 and 6 also attempt to summarize the clusters and depict the essential information that is necessary to understand and compare them. In these tables, the amino acid positions of the proteins are presented for positions where more than 50% of the sequences in the cluster had mutations.

Table 5 Reverse transcriptase amino acid positions with mutations in at least 50% of the sequences by kmeans cluster
Table 6 Protease amino acid positions with mutations in at least 50% of the sequences by kmeans cluster

Tables 5 and 6 show that for the different subtypes, the mutations that characterized some clusters were similar. The clusters from sequences of subtypes B, C and F were similar in terms of the positions in each cluster that had higher frequencies of mutations, excluding positions that occurred more frequently in a given subtype in this data set. For example, positions 15, 20, 36, 41, 69, 89 and 93 for subtype C in the protease; positions 15, 35, 36, 41 and 89 for subtype F in the protease; and position 211 for subtypes C and F in the reverse transcriptase. Moreover, the datasets for subtypes C and F were much smaller than the dataset for subtype B and thus might not represent all the variability in the subtypes. Subtype C was more different compared with subtypes B and F; however, there was still correspondence among the codons defining the clusters.

Correspondence among the clusters could be observed; for example, in protease clusters B6.2, C6.5 and F6.3, which had high percentages of sequences with mutations in positions 10, 54, 82 and 90 (as described in [10,16]) and clusters B6.3, C6.4 and F6.1 in positions 30 and 88 (as described in [7,9-12,23]). Reverse transcriptase clusters B6.3, B6.4, C6.5, C6.6 and F6.5 also showed correspondence and had high percentages of sequences with mutations in positions 67, 70 and 219 (as described in [6,9,25]) and clusters B6.6, C6.3 and F6.3 in positions 41, 67 and 210 (as described in [16]). Clusters B6.1, B6.4, B6.5, C6.2, C6.3, F6.4 and F6.6 from the protease and B6.2, B6.5, C6.1, C6.4, F6.2, F6.4 and F6.6 from the reverse transcriptase contained sequences with few mutations, and are probably susceptible to drugs.

Thus, the clusters suggested that mutations in codons 10, 54, 82 and 90, or in codons 30 and 88, in the protease are related and frequently occur together. In addition, mutations in codons 67, 70 and 219, or in codons 41, 67 and 210 in the reverse transcriptase frequently occur together. These patterns were also reported in previous studies [6,7,9-12,16,23,25] and will be important when investigating the genotype and phenotype (drug resistance) relationships and in designing new drugs.

Conclusion

In this work, a new approach to analyzing HIV mutation data was presented. Current classification schemes are based on rule-based systems and look-up tables that comprise data from scientific studies. The proposed framework is based on a bitmap representation that extracts information from protease and reverse transcriptase sequences and provides information on the interactions among mutations.

A new visualization scheme inspired by microarray data analysis was proposed to better understand the clusters in the HIV domains. The images produced were useful for viewing and comparing the clusters with binary vectors and large volumes of data. In our study, the black and white figures indicated the occurrence and absence of mutations in sequences in each cluster, respectively, thus highlighting the differences between the clusters.

To represent the genetic variability of the virus in a different way from previous works, a large number of sequences and protein positions were used, along with three different HIV-1 subtypes. In the analysis, sequences were clustered, and the clusters were characterized according to the mutation patterns that they represented. The clusters were compared with those clusters revealed by previously published studies, and with the current knowledge of mutation patterns.

Along with the large number of sequences and protein positions, the application of a binary representation for the sequences helped to define a simple measure of similarity. The choice of K-means as the algorithm for mutation pattern searching rendered the method suitable for larger data sets because of its time complexity. The use of the binary image also allowed the analysis of large data sets, as the information in the data is visualized more easily, as is the characterization of the clusters and the mutation patterns.

K-means obtained clusters with similar sequences representing different mutation profiles, and the clusters showed that some mutations frequently occur together, which are important for defining the clusters and that are present in a large number of the sequences. These positions need to be taken into consideration when inferring drug resistance, because they affect a large number of patients.

Some interesting insights came from this clustering result. Notably, mutations in protease codons only produced clusters among non-B strains. Furthermore, as described previously, mutations at codons 89 and 90 in the protease do not cluster together [34], suggesting that methionine at positions 89 and 90 result in a protein structure that is not stable. Mutations at codons 30 and 90 may be selected by the protease inhibitor nelfinavir, but again, these two pairs of mutations do not appear together. It makes biological sense that once you have a replacement such as D30N, you will need a mutation N88D, because these two amino acids interact with each other in the protease protein [35]. However, it has been suggested that the pathway for resistance to nelfinavir will preferentially select the F30N complex among subtype B and exclusively the L90M complex among non-B subtypes [36]. However, we observed the D30N complex among clusters for subtypes B and F (Table 6). It is also interesting that major protease inhibitor mutations, such as in codons 46, 82 and 90, frequently form clusters (Table 6).

Pathways for resistance mutations are the pathways that viruses select for resistance mutations and this is closely related to cross-resistance. TAM 1 and TAM 2 are well-defined distinct pathways for resistance, but we speculate that these are merely initiating pathways because we observed clusters for the reverse transcriptase with between three and six TAMs, thus augmenting levels of resistance and cross resistance (Table 5). Interestingly, all clusters with resistance mutations show the 3TC-related mutation at codon 184 in the reverse transcriptase. When there is an antiretroviral treatment failure using non-nucleoside reverse transcriptase analogs, mutation at codon 103 will emerge in more than 50% of cases and 50% of these viruses will also harbor the mutation at codon 184 [37]. However, all clusters harboring 103 mutations will also be accompanied of 184 mutations, suggesting that real life virological failure is somehow different.

One interesting outcome from this cluster representation is their alleged relationships with previous exposure to specific antiretrovirals. In this sense, timing or the number of drug exposures, as well as the use of specific drugs, would suggest a specific selection of a cluster of mutations and imply possible resistance/cross resistance. The negative predictive value of a genotype result is low, meaning that the absence of a specific mutation or group of mutations does not mean that this mutation is not present in a minority population and is not present because of the selective pressure of current antiretrovirals being used. Therefore, the history of antiretroviral exposure and the projected profile of mutations can result in a more reliable future salvage therapy regimen.

Furthermore, protease inhibitors are designed according to the structure of the proteins; therefore, the clusters may help in designing future drugs for resistant strains.

In addition to antiretroviral resistance, understanding the mutation patterns is also useful in collaborative efforts to study of immune escape pathways and vaccine research. However, the HIV mutation patterns can confound the determination of the immune escape mechanisms [38] that are relevant to the vaccine research [39].

Our future work will include further validation of the clusters in the HIV domains and updating the current knowledge concerning mutations. We will also evaluate a recent approach to pattern recognition known as biclustering [40,41] for the protease and reverse transcriptase sequences. Biclustering algorithms seem to fit our purposes because they search for submatrices in the data matrix, following a determined pattern, and have been applied to large data sets, such as microarray data.

References

  1. Schinazi RF, Larder B, Mellors JW. Mutations in retroviral genes associated with drug resistance:pages = 2000-2001 update. Int’l Antiviral News. 2000; 8:65–92.

    Google Scholar 

  2. Brazilian algorithm. http://forrest.ime.usp.br:3001/resistencia. www.aids.gov.br.

  3. Lathrop RH. Knowledge-based avoidance of drug-resistant hiv mutants In: Press A, editor. Proc. 15th Nat’l Conf. Artificial Intelligence 10th Conf. Innovative Applications of Artificial Intelligence, Menlo Park Calif. Madison, WI: AAAI Press: 1998. p. 1071–8.

    Google Scholar 

  4. Shafer R, Jung DR, Betts BJ. Human immunodeficiency virus type 1 reverse transcriptase and protease mutation search engine for queries. Nat Med. 2000; 6(11):1290–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. DCF, S, MC, S, R, B, Fernandez JCC, SE, LA, I, Diaz R. The brazilian network for hiv-1 genotyping (renageno) external quality control assurance program (eqa). J Int AIDS Soc. 2011; 14(1):45.

    Article  Google Scholar 

  6. Sing T, Svicher T, Beerenwinkel N, Ceccherini-Silberstein F, Daumer M, Kaiser R, et al. Characterization of novel hiv drug resistance mutations using clustering, multidimensional scaling and svm-based feature ranking. In Knowledge Discovery in Databases: PKDD. 2005; 3721:285–96.

    Google Scholar 

  7. Liu Y, Eyal E, Bahar I. Analysis of correlated mutations in hiv-1 protease using spectral clustering. Bioinformatics. 2008; 24(10):1243–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Reuman EC, Rhee S, Holmes SP, Shafer RW. Constrained patterns of covariation and clustering of hiv-1 non-nucleoside reverse transcriptase inhibitor resistance mutations. J Antimicrob Chemother. 2010; 65(7):1477–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Rhee SY, Liu T, Ravela J, J, GM, Shafer RW. Distribution of human immunodeficiency virus type 1 protease and reverse transcriptase mutation patterns in 4,183 persons undergoing genotypic resistance testing. Antimicrob Agents Chemother. 2004; 48:3122–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Wu TD, Schiffer CA, Gonzales MJ, Taylor J, Kantor R, Chou R, et al. Mutation patterns and structural correlates in human immunodeficiency virus type 1 protease following different protease inhibitor treatments. J Virol. 2003; 77(8):4836–47.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Gonzales MJ, Wu TD, Taylor J, Belitskaya I, Kantor R, Israelski D, et al. Extended spectrum of hiv-1 reverse transcriptase mutations in patients receiving multiple nucleoside analog inhibitors. AIDS. 2003; 17:791–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hoffman NG, Schiffer CA, Swanstrom R. Covariation of amino acid positions in hiv-1 protease. Virology. 2003; 17:536–48.

    Article  CAS  Google Scholar 

  13. Alteri C, Svicher V, Gori C, D’Arrigo R, Ciccozzi M, Ceccherini-Silberstein F, et al. Characterization of the patterns of drug-resistance mutations in newly diagnosed hiv-1 infected patients naïve to the antiretroviral drugs. BMC Infectious Diseases. 2009; 9(1):111.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Doherty KM, Nakka P, King BM, Rhee S, Holmes SP, Shafer RW, et al. A multifaceted analysis of hiv-1 protease multidrug resistance phenotypes. BMC Bioinf. 2011; 12:477.

    Article  Google Scholar 

  15. Heider D, Senge R, Cheng W, Hüllermeier E. Multilabel classification for exploiting cross-resistance information in hiv-1 drug resistance prediction. Bioinformatics. 2013; 29(16):1946–52.

    Article  CAS  PubMed  Google Scholar 

  16. Yahi N, Tamalet C, Tourres C, Tivoli N, Ariasi F, Volot F, et al. Mutation patterns of the reverse transcriptase and protease genes in human immunodeficiency virus type 1-infected patients undergoing combination therapy: survey of 787 sequences. J Clin Microbiol. 1999; 37(12):4099–106.

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Yahi N, Tamalet C, Tourres C, Tivoli N, Fantini J. Mutation l210w of hiv-1 reverse transcriptase in patients receiving combination therapy, incidence, association with other mutations, and effects on the structure of mutated reverse transcriptase. J Biomed Sci. 2000; 7:507–13.

    Article  CAS  PubMed  Google Scholar 

  18. Hanna GJ, Johnson VA, Kuritzkes DR, Richman DD, Brown AJ, Savara AV, et al. Pattern of resistance mutations selected by treatment of human immunodeficiency virus type 1 infection with zidovudine, didanosine and nevirapine. J Infectious Diseases. 2000; 181:904–11.

    Article  CAS  Google Scholar 

  19. Marcelin AG, Delaugerre C, Wirden M, Viegas P, Simon A, Katlama C, et al. Thymidine analogue reverse transcriptase inhibitors resistance mutations profiles and association to other nucleoside reverse transcriptase inhibitors resistance mutations observed in the context of virological failure. J Med Virol. 2004; 72:162–5.

    Article  CAS  PubMed  Google Scholar 

  20. Flandre P, Descamps D, Joly V, Meiffredy V, Tamalet C, Izopet J, et al. A survival method to estimate the time to occurrence of mutations: an application to thymidine analogue mutations in hiv-1-infected patients. J Infectious Diseases. 2004; 189:862–70.

    Article  CAS  Google Scholar 

  21. Wolf K, Walter H, Beerenwinkel N, Keulen W, Kaiser R, Hoffmann D, et al. Tenofovir resistance and resensitization. Antimicrob Agents Chemother. 2003; 47:3478–84.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Tang MW, Shafer RW. Hiv-1 antiretroviral resistance: scientific principles and clinical applications. Drugs. 2012; 72(9):1–25.

    Article  Google Scholar 

  23. Deforche K, Camacho R, Grossman Z, Silander T, Soares MA, Moreau Y, et al. Bayesian network analysis of resistance pathways against hiv-1 protease inhibitors. Infection Genet Evol. 2007; 7(3):382–90.

    Article  CAS  Google Scholar 

  24. Rhee S, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res. 2003; 31(1):298–303.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Shafer RRK, Gonzales MJ. The genetic basis of hiv-1 resistance to reverse transcriptase and protease inhibitors. AIDS Rev. 2000; 2(4):211–28.

    PubMed  PubMed Central  Google Scholar 

  26. Johnson A, Calvez V, Günthard H, Paredes R, Pillay D, Shafer R, et al. 2011 update of the drug resistance mutations in hiv-1. Top Antivir Med. 2011; 19(4):156–64.

    PubMed  Google Scholar 

  27. Jain AK, Murty NM, Flynn PJ. Data clustering: A review. ACM Comput Surveys. 1999; 31(3):264–323.

    Article  Google Scholar 

  28. Melikian GL, Rhee S-Y, Varghese V, Porter D, White K, Taylor J, et al. Non-nucleoside reverse transcriptase inhibitor (nnrti) cross-resistance: implications for preclinical evaluation of novel nnrtis and clinical genotypic resistance testing. J Antimicrob Chemother. 2013; 69(1):12–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Kriegel HP, Kröger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data. 2009; 3(1):1–58.

    Article  Google Scholar 

  30. R Development Core Team. R: A language and environment for statistical computing. 2008. http://www.R-project.org. ISBN 3-900051-07-0.

  31. Araújo LV, Calvez V, Ferreira JE. Hiv drug resistance analysis tool based on process algebra. In: Proceedings of the 2008 Symposium on Applied Computting. Ceara Brazil: Fortaleza: 2008. p. 1358–63.

    Google Scholar 

  32. Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. Cluster: Cluster analysis basics and extensions. 2014. R package version 1.15.3 — For new features, see the ‘Changelog’ file (in the package source).

  33. Jain AK. Data clustering: 50 years beyond k-means. Pattern Recognit Lett. 2010; 31(8):651–66.

    Article  Google Scholar 

  34. Calazans A, Brindeiro R, Brindeiro P, Verli H, Arruda M, Gonzalez L, et al. Low accumulation of l90m in protease from subtype f hiv-1 with resistance to protease inhibitors is caused by the l89m polymorphism. J Infectious Diseases. 2005; 191(11):1961–70.

    Article  CAS  Google Scholar 

  35. Mahalingam, e.a. Bhuvaneshwari. Structural implications of drug resistant mutants of hiv 1 protease: High resolution crystal structures of the mutant protease substrate analogue complexes. Proteins: Structure Function Bioinf. 2001; 43(4):455–64.

    Article  CAS  Google Scholar 

  36. Clotet, e.a. Bonaventura. Prevalence of hiv protease mutations on failure of nelfinavir-containing haart: a retrospective analysis of four clinical studies and two observational cohorts. HIV Clin trials. 2002; 3:316–23.

    Article  PubMed  Google Scholar 

  37. Molina J, Andrade-Villanueva J, Echevarria J, Chetchotisakd P, Corral J, David N, et al. Once-daily atazanavir/ritonavir versus twice-daily lopinavir/ritonavir, each in combination with tenofovir and emtricitabine, for management of antiretroviral-naive hiv-1-infected patients: 48 week efficacy and safety results of the castle study. Lancet. 2008; 372(9639):646–55.

    Article  CAS  PubMed  Google Scholar 

  38. Brumme Z, John M, Carlson J, Brumme C, Chan D, Brockman M, et al. Phylogenetic dependency networks: Inferring patterns of ctl escape and codon covariation in hiv-1 gag. PLoS Comput Biol. 2008; 4(11):1000225.

    Article  CAS  Google Scholar 

  39. Brumme Z, John M, Carlson J, Brumme C, Chan D, Brockman M, et al. Hla-associated immune escape pathways in hiv-1 subtype b gag, pol and nef proteins. PLoS ONE. 2009; 4(8):6687.

    Article  CAS  Google Scholar 

  40. Iven VM, Bock H-H, Boeck PD. Two-mode clustering methods: astructuredoverview. Stat Methods Med Res. 2004; 13(5):363–94.

    Article  Google Scholar 

  41. Brehm JH, Koontz DL, Wallis CL, Shutt KA, Sanne I, Wood R, et al. Frequent emergence of n348i in hiv-1 subtype c reverse transcriptase with failure of initial therapy reduces susceptibility to reverse-transcriptase inhibitors. Clin Infectious Diseases. 2012; 55:737–45.

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors are grateful for FAPESP grant #11/50761-2, and to CNPq, CAPES and PRP-USP for financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mina Cintho Ozahata.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Mina Cintho designed and implemented the computational tools for HIV data analysis. Roberto Cesar proposed the clustering methods, as well as the bitmap representations for data analysis. Joao E. Ferreira provided the database design and information, and helped with the data analysis. Ester Sabino and Ricardo Diaz supported the clinical and molecular HIV analyses. All authors wrote and revised this manuscript. All authors read and approved the final manuscript.

Rights and permissions

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ozahata, M.C., Sabino, E.C., Diaz, R.S. et al. Data-intensive analysis of HIV mutations. BMC Bioinformatics 16, 35 (2015). https://doi.org/10.1186/s12859-015-0452-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-015-0452-0

Keywords