Introduction

Helicobacter pylori contributes world-wide to various gastro-duodenal diseases ranging from chronic gastritis to the development of peptic ulcer disease, gastric cancer (GC) and mucosa-associated lymphoid tissue (MALT) lymphoma1,2. Significant variation exist in the prevalence and incidence of peptic ulcer disease and gastric cancer due to infection with H. pylori among the different multiethnic populations in Malaysia (Malays, Chinese and Indians)3,4. The H. pylori prevalence amongst Indians is 35.6%, but the incidence of peptic ulcer disease and gastric cancer is relatively low. Despite lower H. pylori prevalence of 28.6% among the Chinese, incidence of peptic ulcer disease and gastric cancer are relatively high3,5.

Phylogenetic analysis of the H. pylori genomes was carried out to determine the ancestry relationship among the overall international isolates. Analysis of seven housekeeping genes of H. pylori by multi-locus sequence typing (MLST) suggest that the bacteria originated in Africa and later split into seven distinct population groups (hpEurope, hpNEAfrica, hpAfrica1, hpAfrica2, hpAsia2, hpSahul and hpEastAsia) and subpopulations that are strongly associated with geographical localization. As for the subpopulations, AE1 and AE2 are within hpEurope, hspWafrica and hspSAfrica within hpAfrica1, hspIndia within hpAsia2 and hspEAsia, hspAmerind and hspMaori within the population of hpEastAsia6,7,8. In Malaysia, H. pylori hpAsia2/hspIndia mainly colonises Indian and Malay subjects while hpEastAsia/hspEAsia predominantly colonises Chinese subjects. These two groups (hpEastAsia/hspEAsia and hpAsia2/hspIndia) accounted for 41.5% and 39.0% of all H. pylori isolates respectively6.

H. pylori genetic factors may play a role in influencing the disease outcome as Chinese strains from Malaysia that are similar to strains from other areas of high gastric cancer incidence (Japan, Korea and China) predominantly belongs to hspEAsia subpopulation. Meanwhile, local Malay/Indian strains together with their counterparts from India (high H. pylori prevalence but low incidence of gastric cancer) belong to the hspIndia subpopulation6. Besides H. pylori genetic factors, host and environmental factors may also influence disease outcome.

In Western countries, the H. pylori protein encoded by cytotoxin-associated gene A, CagA, has been shown to be strongly associated with peptic ulcer disease, atrophic gastritis and gastric cancer1,9. However, cagA, carried by almost all Asian H. pylori strains, does not predict disease outcome in this region of the world1,10,11. Previous studies demonstrated that cagA containing the EPIYA-D tend to be more virulent than those carrying the EPIYA-C motif due to the higher level of IL-8 secretion produced by EPIYA-D strains as compared with EPIYA-C strains12,13. Studies have proven that Cag-A containing the EPIYA-D motif possessed among the East Asian isolates has a stronger affinity for SHP-2 binding activity in contrast with Western Cag-A. Thus, East Asian isolates are found to be greater in countries with a high prevalence of gastric cancer12,14,15,16,17,18,19. However, Schmidt et al. (2009) and other studies have demonstrated that no association exists between EPIYA motifs and gastro-duodenal disease progression11,20,21. Thus, there is a need to discover new virulence factors by studying the genomic makeup of Asian strains.

H. pylori genetic diversity found in Malaysia, as demonstrated by MLST typing, provides ideal conditions for studying the interaction of co-existing H. pylori populations at the genomic level in a multiethnic society. In this study, we analyzed H. pylori strains from Asians (Malaysian Chinese, Malay, Indian), Japanese and mainland Chinese subjects presenting with different disease status. The aim of the study was to undertake whole genome comparative analysis of these Asian strains to define a subset of H. pylori disease-associated genes that may contribute to gastro-duodenal diseases.

Results

In this study, we carried out comparative genome analysis of 27 H. pylori strains (4 gastric cancer, 10 peptic ulcer and 13 non-ulcer dyspepsia strains). Strains were selected based on the phylogenetic analysis construction using the full genome sequences (Figure 1). Strains isolated from local Chinese subjects clustered closely with hspEAsia strains from Japan (F30, F32 and F57), China (XZ274 and HLJ271) and Korea (HP51). Most strains from the local Malay and Indian subjects (except UM037) were found clustered together to form the hspIndia branch. Out of 27 genomes selected for comparative analysis, 21 were isolated from Malaysia, three from Japan, two from China and one from Korea (Table 1). The genomes of non-Malaysian isolates were obtained from GenBank. Comparative genomic analysis revealed six genes that showed significant association with peptic ulcer disease and/or gastric cancer (Table 2). Three genes was associated with gastric cancer and the remaining 3 were associated with both peptic ulcer disease and gastric cancer. All 6 genes were statistically significantly correlated with disease with P-value <0.05 when the percentage of identity were compared using Student's t-test (Table 3). Gene ontology and prediction of protein families, domains and functional sites is presented in Table 4. Only genes with predicted structure and functions were selected for further analysis. The Pearson's correlation coefficient test was performed to evaluate the correlation between these genes and disease status, as well as the correlation between different genes (Table 5).

Table 1 Details of strains used in this study
Table 2 Candidate genes identified as associated with GC, PUD and NUD with percentage of identity. Percentage of identity generated by RAST >80% was considered similar
Table 3 Significance of association of candidate genes with disease was examined by Student's t-test
Table 4 Structural and domain prediction of candidate genes by Blast2Go
Table 5 Pearson's correlation coefficient test on relationship between candidate gene and disease and between different genes
Figure 1
figure 1

Phylogenetic tree constructed on a whole genome of 33 H. pylori strains.

Asian strains used in this study mostly clustered together with hspIndia and hspEAsia strains according to host ethnicity (except UM037).

Gastric Cancer

A 456 bp gene encoding for a H. pylori membrane protein GC26_77 was absent in all (0/10) peptic ulcer disease (PUD) strains analyzed. This gene was present in all gastric cancer (GC) (4/4) and non-ulcer dyspepsia (NUD) (13/13) strains (Table 2). While no significant statistical significance (P-value ≥ 0.05) was observed between presence of this gene in strains isolated from GC and NUD patients, the gene was absent in PUD patients (P-value < 0.001) (Table 3). Pearson's correlation also revealed a strong negative correlation with PUD (Table 5).

Further, the presence of hypothetical ATPase protein GC26_69 was shown to correlate with strains isolated from GC patients. The ATPase gene was detected in 100% (4/4) GC strains as compared with 30% (3/10) of PUD strains and 0% (0/13) of isolates from NUD. Interestingly, the 3 isolates in which the hypothetical ATPase protein GC26_69 gene was detected originated from PUD patients from China, Japan and Korea. None of the PUD strains isolated from Malaysia carried this gene (Table 2). Highly significant associations were observed in the prevalence of this gene in isolates from GC patients as compared with isolates from PUD and NUD (P-value <0.001) (Table 3). The association of hypothetical ATPase protein GC26_69 with GC was confirmed by a strong positive correlation by Pearson's analysis (Table 5).

In addition a hypothetical protein GC26_73 was detected in all (4/4) GC strains but only 20% (2/10) and 7.7% (1/13) of PUD and NUD strains respectively (Table 2). Comparison of the prevalence of this gene in strains isolated from GC patients with that in both PUD and NUD patients showed a highly significant difference to exist (P-value <0.001) (Table 3). The association of hypothetical protein GC26_73 with GC was confirmed by a strong Pearson's positive correlation (Table 5).

Peptic Ulcer Disease and Gastric Cancer

Genes encoding for the H. pylori outer membrane protein GC26_66, phospho-2-dehydro-3-deoxyheptonate aldolase and hypothetical protein GC26_33 were all highly associated with both GC and PUD strains. The gene encoding for outer membrane protein GC26_66 being present in all GC strains (4/4) and 90% (9/10) of the PUD strains. The outer membrane protein gene was not detected in any of the six Chinese NUD strains (Table 2). A highly statistically significant difference was observed between the prevalence of outer membrane protein in GC strains and Chinese NUD strains (P-value <0.001) (Table 3). While outer membrane protein GC26_66 was also present in NUD strains from local Malay and Indians, alignment of the translated protein sequence revealed that these local Malay and Indian strains from NUD patients differed from GC/PUD strains in the 2nd to 5th amino acid positions (Figure 2). For the purpose of discussion, we will differentiate the former as type 1 and the later as type 2. Outer membrane protein GC26_66 (type 1) was found to have strong negative Pearson's correlation with NUD (Table 5).

Figure 2
figure 2

Sequence alignment of translated outer membrane protein GC26_66 gene.

According to the translated peptide, sequences were grouped as type 1 and type 2. Type 1 strains were GC26, F32, F57, XZ274, UM065, UM066, UM077, F30, HP51 and HLJ271. Type 2 strains were UM067, UM084, UM114, FD662, FD703, FD719, FD423, FD430, FD535 and UM037. There was close correlation with host ethnicity.

The phospho-2-dehydro-3-deoxyheptonate aldolase gene was present in all of the GC strains (4/4) and 90% (9/10) of the PUD strains. In contrast, this gene was only found in 15.4% (2/13) of the NUD strains (Table 2). A statistically increased prevalence of this gene was observed in isolates from GC patients as compared with isolates from NUD (P-value <0.001) (Table 3). Phospho-2-dehydro-3-deoxyheptonate aldolase was also found to have strong negative Pearson's correlation with NUD (Table 5).

The gene encoding for hypothetical protein GC26_33 was present in all GC strains (4/4). 60% (6/10) of the PUD strains and only 7.7% (1/13) of the NUD strains (Table 2). Comparison of the prevalence of this gene in H. pylori strains isolated from GC and PUD patients showed a significantly increased prevalence (P-value <0.05) in those patients with GC. Comparison of strains isolated from GC and NUD patients showed a significantly increased prevalence in isolates from GC patients (P-value <0.001) (Table 3). Hypothetical protein GC26_33 was also found to have strong negative Pearson's correlation with NUD (Table 5).

Discussion

In this study, we identified six genes that were associated with H. pylori-related disease status. In agreement with previous studies, no single H. pylori genetic marker on its own was shown to be associated with any specific disease group22,23. However, it is possible that these genes, working individually or collaborating with other genetic elements, are potential risk factors influencing the severity of disease or disease progression. Based on Pearson's analysis of the correlation between disease status and genes, as well as the correlation between different genes presented in Table 4, a correlation map was prepared to summarize the relationship between disease and genes (Figure 3). This map illustrates the complexity of H. pylori factors that may affect disease outcome. From the map, the number of genes that potentially influence disease status appears to increase with severity of disease. However, it should be noted that the correlation map only illustrates the co-existence of genes but does not imply that these genes actually interacts or work together to have an impact on disease development.

Figure 3
figure 3

Correlation map of candidate genes with disease states and between genes.

The correlation was determined by Pearson's correlation (Table 5). Dotted line represents negative correlation and full line represents positive correlation. Bold line represents strong correlation with significance <0.01 and fine line represents correlation with significance <0.05. Genes outside the boxes represent genes present in the respective disease strains (Student's t-test) but no significant correlation with respective disease status (Pearson's).

Based on our comparative genomic analysis, the presence of H. pylori phospho-2-dehydro-3-deoxyheptonate aldolase in the absence of membrane protein GC-26_77 was demonstrated to be a risk factor for the development of peptic ulcer disease. In contrast, outer membrane protein GC26_66 (type 1), hypothetical ATPase protein GC26_69, hypothetical protein GC26_73 and hypothetical protein GC26_33 were shown to be risk factors for gastric cancer development. Outer membrane GC26_77 may have a protective effect against peptic ulcer disease. However, these data were based on bioinformatic analysis of strains from Malaysia and few strains from other parts of Asia. There is a need to validate these results experimentally in the laboratory. Furthermore, it will be important to be able screen and estimate the carriage rate of these genes in a larger number of strains with different disease presentations from other parts of Asia as well as other parts of the world.

H. pylori is an ancient and permanent resident of the human stomach and has likely been part of the gastric microbiome since the origin of human species. Given this, it is not surprising that more than 80% of those infected with H. pylori remain asymptomatic. It has been estimated that H. pylori-positive individuals have a 10 to 20% lifetime risk of developing peptic ulcer disease and a 1 to 2% risk of developing gastric cancer24. While there is strong evidence to support the role of H. pylori in gastro-duodenal disease, there is some evidence that the decline in H. pylori infection among human populations has been suggested to have contributed to the increase in other diseases (e.g., esophageal adenocarcinoma [EAC], allergic asthma, rhinitis and atrophy)25. Furthermore, it has been reported that in adult males H. pylori colonization is associated with reduced circulating leptin levels, a finding that may explain the observation that significant weight gain may occur with H. pylori eradication26. Furthermore, Blaser, M.J. and Atherton, J.C. have suggested that the fall in H. pylori prevalence in developed countries may also contribute to high risk of metabolic syndrome, type II diabetes and metabolic obesity27. Current literature would suggest that infection with H. pylori is rapidly disappearing in developed countries28.

These risk factors can be potential biomarkers to identify those strains that present with higher risk of developing severe gastro-duodenal complications. The use of an approach where selective eradication therapy is given to individuals with these risk genes instead of all those infected with H. pylori will also help to preserve the usual gut microbiota. The understanding of H. pylori genetic factors and its association with severe gastro-duodenal diseases is necessary to decide on the optimal management of H. pylori infections.

A similar study has examined the genomic characteristics among 84 H. pylori isolates frm China with differing clinical status using microarray29. In this study, regions associated with genes involved in bacterial R-M systems and type IV secretion system were identified to be linked to disease status. However, these genes were not identified in this study, instead, a different set of genes were identified in this study. This inconsistency could be due to the different approach adopted and the strains selected for these studies. The microarray was designed based on 6 sequenced strains of European and American origins. Thus, genomic variations present only in Asian strains but not in Western strains will not be identified using the microarray approach. Although the next-generation sequencing approach is not limited by probe design, it can be limited by the quality of sequencing data. Also, the higher cost by the next-generation sequencing approach will also complicate mass screening of large number of strains.

In summary, we have identified a number of H. pylori genetic factors that may enable the identification of those at risk of peptic ulcer disease strains and gastric cancer using a comparative genomic approach. However, screening of a larger number of H. pylori strains from different disease groups is also required. Furthermore, it would be of interest to investigate the functions of these largely hypothetical proteins and how they interact to contribute to H. pylori-induced pathogenesis.

Methods

Sample background

Twenty-one H. pylori from Malaysia (GC26, UM023, UM065, UM066, UM077, UM067, UM084, UM114, FD506, FD568, FD577, UM038, UM085, UM111, FD662, FD719, FD703, FD423, FD430, FD535 and UM037) were isolated from symptomatic patients undergoing endoscopy procedure at University of Malaya Medical Centre (UMMC, Kuala Lumpur, Malaysia) between the years 2007 and 2012 (Table 1). Based on endoscopic and histological examinations, patients were diagnosed as having gastric cancer (GC), peptic ulcer disease (PUD) or non-ulcer dyspepsia/functional dyspepsia (NUD/FD). All biopsies were obtained with the informed consent of patients and approval of the Human Ethics Committees of UMMC and UNSW. The genomes of the twenty-one Malaysian H. pylori strains were sequenced and assembled de novo as previously described30. For comparison, we downloaded the genome sequences of another 6 Asian H. pylori strains from the National Center for Biotechnology Information (NCBI) GenBank. Further information of these strains are provided in Table 1. Only F30, F32, F57 and HP51 are complete genomes, the rest are available as drafts.

Sequence Alignment and Phylogenetic Analysis

In addition to the 27 genomes analyzed in this study, an additional of six strains regardless of East Asian origin was downloaded from NCBI database resulting in a total of 33 strains that were used to align with MAUVE (version 2.3.1) progressive alignment software and the output data was viewed using the SplitsTree program (version 4.12.8) as Super Network.

Downstream analysis

Annotation of all twenty-seven H. pylori strains were performed using the Rapid Annotation using Subsystem Technology (RAST) version 4.031. Comparative genomic analysis was performed using the sequence comparison function available on the SEED viewer version 2.0. A List of core genes present among all strains belonging to the same group was generated by SEED for each disease groups (GC, PUD and NUD). FASTA file containing the sequences these of core genes were generated for the respective disease groups. Further, RAST analysis was performed to this data for annotation and comparing against other H. pylori strains in Table 1 to obtain the percentage of similarity. The Student's unpaired two-tailed t-test was performed and the genes with P-values of < 0.05 and < 0.01were considered statistically significant and highly significant respectively. Two-tailed Pearsons' correlation analysis was adopted to examine for correlation between candidate genes and disease, as well as between different genes.

Accession Numbers

The accession numbers of the H. pylori genome sequences reported in this paper are: GC26 (AKHV00000000), UM023 (AUSK00000000), UM065 (AUSM00000000), UM066 (AUSJ00000000 and CP005493), UM077 (AUSQ00000000), UM067 (AUSN00000000), UM084 (AUSO00000000), UM114 (AUSS00000000), FD506 (AKHO00000000), FD568 (AKHQ00000000), FD577 (AKHR00000000), UM038 (AUSL00000000), UM111 (AUSR00000000), FD662 (AKHT00000000), FD719 (AKHU00000000), FD703 (AKHS00000000), FD423 (AKHM00000000), FD430 (AKHN00000000), FD535 (AKHP00000000), UM037 (AUSI00000000 and CP005492), UM085 (AUSP00000000), F30 (AP011941), F32 (AP011943), F57 (AP011945), HP51 (CP000012), XZ274 (CP003419) and HLJ271 (ALKB00000000).