siRNA off-target effects in genome-wide screens identify signaling pathway members

Buehler, Eugen; Khan, Aly A.; Marine, Shane; Rajaram, Misha; Bahl, Amit; Burchard, Julja; Ferrer, Marc

doi:10.1038/srep00428

Download PDF

Article
Open access
Published: 29 May 2012

siRNA off-target effects in genome-wide screens identify signaling pathway members

Eugen Buehler¹,
Aly A. Khan²,
Shane Marine³,
Misha Rajaram⁴,
Amit Bahl⁵,
Julja Burchard⁵ &
…
Marc Ferrer¹

Scientific Reports volume 2, Article number: 428 (2012) Cite this article

5646 Accesses
52 Citations
1 Altmetric
Metrics details

Subjects

Abstract

We introduce a method for analyzing small interfering RNA (siRNA) genetic screens based entirely on off-target effects. Using a screen for members of the Wnt pathway, we demonstrate that this method identifies known pathway components, some of which are not present in the screening library. This technique can be applied to siRNA screen results retroactively to confirm positives and identify genes missed using conventional methods for on-target gene selection.

Combinatorial single-cell CRISPR screens by direct guide RNA capture and targeted sequencing

Article 30 March 2020

Joseph M. Replogle, Thomas M. Norman, … Britt Adamson

High-content CRISPR screening

Article 10 February 2022

Christoph Bock, Paul Datlinger, … Xiaowei Zhuang

Tutorial: design and execution of CRISPR in vivo screens

Article 15 July 2022

Christian J. Braun, Andrés Carbonell Adames, … Roland Rad

Introduction

Large-scale siRNA screening is an increasingly popular method of interrogating signaling pathways both to identify relevant genes and to discover novel drug targets. siRNA screens typically involve transfecting small numbers of cells with a library designed to target the entire genome or some subset (such as a set of genes coding for theoretically druggable domains)¹. The systematic knockdown of each gene enables high-throughput loss-of-function analysis by assessing pathway dependent phenotypes as measured through changes in the appearance of specific markers or induction of reporter constructs (Fig. 1a).

Unfortunately, siRNA screens have demonstrated a high false positive rate². Researchers typically perform labor-intensive follow-up work on hundreds of hits to confirm a handful of relevant genes. Many false positives are likely due to off-target effects^3,4, wherein partial complementarity between an siRNA and multiple transcripts, typically in the 3′UTR, results in their down-regulation, adding unintended silencing to the screen (Fig. 1b). Previous work on attenuating off-target effects has largely focused on identifying lower-risk sequences, introduction of chemically modified siRNAs, or use of multiple siRNA sequences in additional screens⁵.

Results

In order to understand and exploit the off-target effects present in siRNA screening data, we implemented a predictive model of down-regulation due to siRNA off-targeting. Existing predictors are microRNA-related and often use conservation or other criteria not applicable to siRNA off-targeting⁶. We trained a simple linear model specific for siRNAs using published gene expression profiles in which off-targeting mediated by the seed (positions 2–8 of the guide strand) has been detected⁷. Our model for off-target seed-based down-regulation is:

The model includes four types of seed matches, or reverse complementarity between the guide strand seed sequence and the 3′UTR of the transcript: PM, perfect match to guide bases 2–7 followed by adenine opposite base 1; M1, no adenine opposite base 1; M8, mismatch opposite base 1; and M18, with both terminal mismatches. We calculate as predictive variables the number of times a particular match-type occurs between the seed sequence of the siRNA n and the 3′UTR of the transcript k. In addition, we use the length of the 3′UTR as a predictive variable, since in our analyses longer 3′UTRs were correlated with up-regulation of transcripts in the absence of siRNA seed matches. Up-regulation of genes with long 3′UTRs and a lower density of siRNA matches has been previously observed in the literature⁸. Model parameters (β) are derived from linear regression against the mean log ratio of mRNA expression data (see Fig. 2, Supplementary Data Set 1 and Methods).

We hypothesize that siRNAs off-targeting to pathway members can in part explain the phenotypes observed in genome-wide screens (Fig. 1c). To investigate this, we introduce Haystack, a new computational method for identifying the most statistically significant genes that explain screen results, based entirely on off-target effects. First, we predict all O_n,k, the off-target down-regulation of every transcript k by every siRNA n. Second, for each transcript we calculate the significance of correlation between the predicted down-regulation by every siRNA to the siRNAs' screen results z, using a t-test. Third, the most statistically significant transcript is selected and a linear parameter a_k is estimated, approximating z as a function a_k * O_k + c. Finally, the residual between z and the predicted values of z in this linear model is calculated and the next transcript is selected via the significance of the correlation of each remaining O_k to the residual. In this stepwise manner, the most statistically significant transcripts are selected and added iteratively as features to a linear model, until no transcript has a Bonferonni-corrected correlation p-value less than 0.01. The final model can be viewed as predicting the phenotypic score associated with an siRNA z_n as a linear combination of the predicted off-target effects of siRNA n on a set of transcripts K (with some constant intercept term c):

We assess the transcripts selected based on a_k, the magnitude of the off-target effects on transcript k in explaining the screening results. The directionality of a_k indicates the effect (either positive or negative) that down-regulation of each transcript has on the assay readout.

We applied Haystack to 19,815 siRNAs used in screening 6,605 theoretically druggable genes for activity in the Wnt/β-catenin signaling pathway ( Supplementary Data Set 2 ). The Wnt/β-catenin pathway is constitutively active in many human cancers. To screen for novel factors in the Wnt/β-catenin pathway, HT1080 sarcoma cells were engineered to contain a firefly luciferase reporter coupled to a β-catenin-driven promoter, activated in the screen by conditioned media containing Wnt-3a. A control EF1α-driven Renilla luciferase reporter was used for normalization. Three siRNAs per gene were transfected individually into the reporter cell line in three separate screens. We calculated z-scores for the siRNAs from the log ratio of reporter intensities.

Table 1 lists, ordered by p-value, the top 10 genes included in the model built via Haystack from the siRNA screens in combination. Predicted activities per gene correlated well between screens when analyzed separately (Fig. 3). Supplementary Table 1 contains all 61 hits identified. In the case of the Wnt pathway, a large number of “canonical” pathway members have been previously identified. To measure pathway enrichment in screening results, we used 158 Wnt related genes from the KEGG pathway database⁹. Of the top 10 most statistically significant transcripts, 6 (LEF1, AXIN2, CCND1, RAC1, CTNNB1 and LRP6) are in the KEGG list of Wnt pathway genes, a remarkable level of enrichment (P < 3.7e-11; hypergeometric test for enrichment).

Table 1 Top 10 most statistically significant genes identified by Haystack Analysis. Genes that are part of the canonical Wnt pathway are in bold (LEF1, AXIN2, CCND1, RAC1, CTNNB1 and LRP6). Z-scores and their respective ranks for three different siRNAs designed against these genes are displayed. Scores and ranks in the top 10% are in bold and would be considered hits. Some z-scores are not present (--) because the library used in the screen does not contain any siRNAs designed against these genes

Full size table

We reviewed literature on the genes present in Table 1 but not listed in KEGG to check for other associations to the Wnt pathway. SUMO2, the third most statistically significant hit, has been identified as sumoylating and repressing LEF1¹⁰. Consistent with that role, its parameter estimate in the regression indicates that down-regulation of SUMO2 increases reporter activity. Another hit ( Supplementary Data Set 3 ), SENP7, has been shown to reverse the sumoylation of SUMO2/3¹¹ and accordingly the regression analysis shows that its down-regulation decreases Wnt reporter activity. Finally, the transcription factor POU4F2 (BRN3B) is involved in both cell development¹² and cancer¹³, as is the Wnt pathway, making a role for POU4F2 in Wnt signaling seem plausible.

It is worth noting that three transcripts identified in the top 10 hits (RAC1, CLEC18B and POU4F2) did not have siRNAs designed against them in the druggable library screened here. Although RAC1 is not included in the library, it is found by Haystack analysis and is a canonical member of the Wnt pathway. Since off-target effects span the transcriptome, Haystack can detect biologically relevant genes for which no siRNA exists in our screening library.

Discussion

Haystack analysis can be applied to any previous siRNA screen of sufficient scale, preferably with normally distributed scores, to confirm hits via an orthogonal metric or to identify genes originally missed due to library composition. Unfortunately, it is not yet standard for raw data from siRNA screens to be published, as microarray expression profiling data are. Consequently, Haystack is available as a download (http://rnai.nih.gov/haystack/Software.zip) or web-based service (http://rnai.nih.gov/haystack/) for researchers who have performed screens to analyze their data and confirm hits or find new hits potentially missed by “on-target” analysis.

Finally, although a genome-specific siRNA library would be expected to have relatively poor on-target performance against cells from another organism, a large-scale library would likely have the seed sequence diversity required to perform Haystack analysis on any organism having a RNA interference pathway. Thus Haystack also presents a novel solution for siRNA screens in cells from organisms for which no library is yet available.

Methods

Software used

Statistical analysis and graphs were generated using R¹⁴. ROC curves were generated using the ROCR package¹⁵. The kernel density plot of 3′UTR length distributions was generated using the sm package¹⁶.

3′UTR sequences

Human 3′UTR sequences (a total of 34268 sequences) from the Refseq¹⁷ collection were downloaded using the UCSC table browser¹⁸. These sequences were matched against the IDs for the gene expression data sets for purposes of training the off-target model. To create the matrix of predicted off-target effects, 170 sequences that were shorter than 20 nucleotides and 2554 sequences that were non-coding (NR prefix in accession) were removed. In addition, 10852 sequences were removed because they were identical to a 3′UTR already included. The remaining 20692 3′UTR sequences were used to create the off-target matrix.

Gene expression data

Expression data previously shown to be enriched with transcripts that were down-regulated and contained seed matches to transfected siRNAs in their 3′ UTRs⁷ were downloaded from the GEO website¹⁹ and used in training and testing the siRNA off-target model.

Training and testing the off-target model

Features for predicting siRNA off-target down regulation were selected based on the previously observed importance of the seed sequence^7,20 (bases 2–7 of the guide strand) and the observed preference for adenine opposite from base 1 of the guide strand²¹. Although all possible match types to the seed including one or two base-mismatch were considered, only four match types (PM, M1, M8, M18) were statistically significant predictors of down-regulation. The reverse complement of the guide strand seed sequence followed by an adenine forms a perfect match (PM) and mismatches to base 1 (M1), base 8 (M8), or both (M18) form the four different seed match types selected. In addition, we used the length of the 3′UTR as a predictive variable, since in our analyses longer 3′UTRs were correlated with up-regulation of transcripts in the absence of siRNA seed matches. Linear models predicting the down-regulation of a transcript based on the number of each match type observed in the 3′UTR and the total length of the 3′UTR were trained separately on gene expression data from the transfection of two different siRNAs. These models were then cross-validated on the data not used to train them and finally the data sets ( Supplementary Data Set 1 ) were combined to train a final model (Fig. 2).

siRNA libraries

The custom siRNA libraries used in screening were synthesized by Sigma–Proligo. siRNAs were designed to have sequence asymmetry to increase efficacy and had less than 17 bp of complementarity to other genes²². The library contains 6605 sets of 3 siRNAs each designed to theoretically druggable genes.

siRNA screening data

Data used to evaluate these methods came from a genome-scale siRNA screen^23,24 in the HT1080 sarcoma cell line for genes involved in the Wnt/β-catenin pathway. Cells were engineered to contain a luciferase reporter coupled to a β-catenin-driven promoter, which was subsequently activated with conditioned media containing Wnt-3a. In primary screening, both a whole genome and a druggable library were screened in pools with at least three replicates. In addition, all single siRNAs contained within the druggable pools were assayed separately. All assay results were normalized to a plate-based non-targeted control to yield a z-score.

Correlation and statistical significance calculations

Correlation between the screen results z (or later the residual) and the predicted off-target effects of the library on a given transcript were calculated using the formula:

where O_n,k is our prediction of the off-target down-regulation of each mRNA transcript k by each siRNA n. The t-statistic of the null hypothesis that the correlation is actually zero is then calculated as:

with the associated p-value of the null hypothesis being derived from the Student's t-distribution with n-2 degrees of freedom, where n is the number of siRNA screening results.

Since the off-target model makes the same predictions for every siRNA with the same heptamer seed sequence (bases 2–8 of the guide strand) and there are only 16384 possible heptamers, we can pre-compute the predictions for each 3′UTR. Using this simplification, we do not need a separate set of predictions for each siRNA library. To use this pre-computed matrix to calculate Haystack t-statistics for a set of single siRNAs used in a screen, we first calculate the mean assay response for all siRNAs that contain the same seed sequence. We then calculate the weighted correlation coefficient and corresponding t-statistic between the predicted down-regulation caused by each seed sequence and the mean observed response of that seed sequence in the assay (or the residual in later steps), where the weights are the number of occurrences of each seed sequence in the library. A weighted linear regression is performed at each iteration between the selected O_k and the mean assay response for each seed. Likewise, it is possible to use Haystack analysis on siRNA screens performed using pools by calculating the mean response of the assay for all pools in which one of the component siRNAs has a given seed sequence and then weighting the correlation coefficient and linear regression by the number of pools in the screen containing that seed sequence. PERL and R code for pre-computing off-target predictions from a FASTA formatted file of 3′UTRs and calculating these t-statistics from mean seed sequence responses are supplied from the Haystack website (http://rnai.nih.gov/haystack/Software.zip).

Validation of statistical techniques

The validity of using Pearson's correlation coefficient, the t-statistic and corresponding p-value to estimate the statistical significance of non-normally distributed vectors has been a subject of some debate²⁵ and seems to depend on the size of the vectors and the particular characteristics of the distributions considered. Although the z-scores generated from this set of screening data are fairly normal in their distribution, the predictions of off-target effects to which we are comparing them are non-normal. Therefore, we tested how well the Student's T distribution was approximating the distribution expected at random.

Z-scores from the druggable singles collection were randomly permuted against their original heptamers. Fig. 4a shows a q-q plot of the t-statistics resulting from the random permutation (in red) and the distribution observed for the original non-permuted data (in blue) compared to the Student's T-distribution with 10029 degrees of freedom (which is the length of the vectors for which the null hypothesis of zero correlation is being tested, minus 2). As expected, the quantiles from our randomly permuted data track well against the Student's t-distribution, while the non-permuted data does not.

We next performed 1000 random permutations of the same data set and noted transcripts for which the Bonferroni corrected p-value (calculated from the t-statistic) was less than 0.01. Fig. 4b shows the distribution of false positives observed in 1000 trials. Although the majority of permutations did not result in any false positives, 20.3% of the permutations resulted in one or more false positives, which is significantly greater than the 1% of false positives that we would expect if Student's t-distribution perfectly modeled the null hypothesis we are testing. The mean number of false positives per random permutation was 0.263. In comparison to the 1217 putative positives observed for the original, non-permuted data, we judge the risk of false positives to be small and error in this approximation to be acceptable, given that the alternative would be to use non-parametric statistics or random permutation to estimate p-values, either of which would significantly reduce our sensitivity and increase the already significant computation time required.

To further investigate the source of false positives, we plotted the distribution of 3′UTR length for the false positives identified in the 1000 random permutations versus all transcripts analyzed and those transcripts identified as positives in the non-permuted data (Fig. 4c). There is a significant shift towards smaller 3′UTRs in the set of false positives compared to the general population. This may indicate that the approximation of the Student's T distribution is slightly less valid for shorter transcripts. Similarly, transcripts identified as positives in the non-permuted set are shifted towards longer 3′UTRs than the set of all 3′UTRs. We believe this may be the case because true positives with short 3′UTRs have a smaller “off-target signature”, being down-regulated by fewer siRNAs off-target and thus presenting less signal for detection by the method. It is also possible that short 3′UTRs represent incorrect or incomplete sequences. Thus we conclude that transcripts with short 3′UTRs may be more likely to be both false positives and false negatives in Haystack analysis.

References

Mohr, S., Bakal, C. & Perrimon, N. Genomic Screening with RNAi: Results and Challenges. Annual Review of Biochemistry 79, 37–64 (2010).
Article CAS Google Scholar
Echeverri, C. J. et al. Minimizing the risk of reporting false positives in large-scale RNAi screens. Nat Methods 3, 777–779 (2006).
Article CAS Google Scholar
Birmingham, A. et al. 3′ UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nat Methods 3, 199–204 (2006).
Article CAS Google Scholar
Ma, Y., Creanga, A., Lum, L. & Beachy, P. A. Prevalence of off-target effects in Drosophila RNA interference screens. Nature 443, 359–363 (2006).
Article ADS CAS Google Scholar
Jackson, A. L. & Linsley, P. S. Recognizing and avoiding siRNA off-target effects for target identification and therapeutic application. Nature reviews 9, 57–67 (2010).
CAS PubMed Google Scholar
Hammell, M. Computational methods to identify miRNA targets. Seminars in cell & developmental biology 21, 738–744 (2010).
Article CAS Google Scholar
Jackson, A. L. et al. Widespread siRNA “off-target” transcript silencing mediated by seed region sequence complementarity. Rna 12, 1179–1187 (2006).
Article CAS Google Scholar
Khan, A. A. et al. Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs. Nat Biotechnol 27, 549–555 (2009).
Article CAS Google Scholar
Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. & Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research 38, D355–D360 (2010).
Article CAS Google Scholar
Sachdev, S. et al. PIASy, a nuclear matrix-associated SUMO E3 ligase, represses LEF1 activity by sequestration into nuclear bodies. Genes Dev 15, 3088–3103 (2001).
Article CAS Google Scholar
Shen, L. N., Geoffroy, M. C., Jaffray, E. G. & Hay, R. T. Characterization of SENP7, a SUMO-2/3-specific isopeptidase. Biochem J 421, 223–230 (2009).
Article CAS Google Scholar
Xiang, M. et al. Role of the Brn-3 family of POU-domain genes in the development of the auditory/vestibular, somatosensory and visual systems. Cold Spring Harb Symp Quant Biol 62, 325–336 (1997).
Article CAS Google Scholar
Budhram-Mahadeo, V. S. & Latchman, D. S. Targeting Brn-3b in breast cancer therapy. Expert Opin Ther Targets 10, 15–25 (2006).
Article CAS Google Scholar
R Development Core Team, R. F. F. S. C. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing 1, 2673 (2008).
Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. .Bioinformatics 21, 3940–3941 (2005).
Article CAS Google Scholar
Bowman, A. W. & Azzalini, A. Computational aspects of nonparametric smoothing with illustrations from the sm library. Computational Statistics & Data Analysis 42, 545–560 (2003).
Article MathSciNet Google Scholar
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35, D61–65 (2007).
Article CAS Google Scholar
Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32, D493–496 (2004).
Article CAS Google Scholar
Barrett, T. et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Research 37, D885–D890 (2009).
Article CAS Google Scholar
Lin, X. et al. siRNA-mediated off-target gene silencing triggered by a 7 nt complementation. Nucleic Acids Research 33, 4527–4535, 10.1093/nar/gki762 (2005).
Article CAS PubMed PubMed Central Google Scholar
Lewis, B. P., Burge, C. B. & Bartel, D. P. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15–20 (2005).
Article CAS Google Scholar
Majercak, J. et al. LRRTM3 promotes processing of amyloid-precursor protein by BACE1 and is a positional candidate gene for late-onset Alzheimer's disease. Proc Natl Acad Sci U S A 103, 17967–17972 (2006).
Article ADS CAS Google Scholar
Major, M. B. et al. New regulators of Wnt/beta-catenin signaling revealed by integrative molecular screening. Sci Signal 1, ra12 (2008).
PubMed Google Scholar
Tang, W. et al. A genome-wide RNAi screen for Wnt/beta-catenin pathway components identifies unexpected roles for TCF transcription factors in cancer. Proc Natl Acad Sci U S A 105, 9697–9702 (2008).
Article ADS CAS Google Scholar
Kowalski, C. J. On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coefficient. Journal of the Royal Statistical Society. Series C (Applied Statistics) 21, 1–12 (1972).
Google Scholar

Download references

Acknowledgements

Thanks to Christina Leslie of Sloan-Kettering Institute, Steven Bartz, Peter Linsley and Alan Sachs of Merck & Co., Inc. for their guidance, encouragement and support.

Author information

Authors and Affiliations

National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, MD 20892
Eugen Buehler & Marc Ferrer
Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, NY 10065
Aly A. Khan
Automated Biotechnology, Merck Research Laboratories, PA 19454, North Wales
Shane Marine
Cardiovascular Research Institute, University of California, San Francisco, CA 94143, San Francisco
Misha Rajaram
Merck & Co.Inc., West Point, PA 19486
Amit Bahl & Julja Burchard

Authors

Eugen Buehler
View author publications
You can also search for this author in PubMed Google Scholar
Aly A. Khan
View author publications
You can also search for this author in PubMed Google Scholar
Shane Marine
View author publications
You can also search for this author in PubMed Google Scholar
Misha Rajaram
View author publications
You can also search for this author in PubMed Google Scholar
Amit Bahl
View author publications
You can also search for this author in PubMed Google Scholar
Julja Burchard
View author publications
You can also search for this author in PubMed Google Scholar
Marc Ferrer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

EB conceived, developed and tested the algorithm, software and related statistics, wrote the original manuscript and produced all the supplementary materials. AAK helped significantly revise the manuscript and made Figure 1, as well as testing alternative methods of feature selection. SM generated all of the screening data used. EB and JB developed the mathematical model of siRNA off-targeting. SM and MF provided feedback on biological aspects of the manuscript. MF supervised and guided the siRNA screen. MR explored alternative methods of deriving parameters for the off-target model.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Dataset 1

Rights and permissions

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareALike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/

Reprints and permissions

About this article

Cite this article

Buehler, E., Khan, A., Marine, S. et al. siRNA off-target effects in genome-wide screens identify signaling pathway members. Sci Rep 2, 428 (2012). https://doi.org/10.1038/srep00428

Download citation

Received: 28 March 2012
Accepted: 14 May 2012
Published: 29 May 2012
DOI: https://doi.org/10.1038/srep00428

This article is cited by

OSM potentiates preintravasation events, increases CTC counts, and promotes breast cancer metastasis to the lung
- Ken Tawara
- Celeste Bolin
- Cheryl L. Jorcyk
Breast Cancer Research (2018)
Improved estimation of cancer dependencies from large-scale RNAi screens using model-based normalization and data integration
- James M. McFarland
- Zandra V. Ho
- Aviad Tsherniak
Nature Communications (2018)
Evaluation and control of miRNA-like off-target repression for RNA interference
- Heeyoung Seok
- Haejeong Lee
- Sung Wook Chi
Cellular and Molecular Life Sciences (2018)
Seed-effect modeling improves the consistency of genome-wide loss-of-function screens and identifies synthetic lethal vulnerabilities in cancer cells
- Alok Jaiswal
- Gopal Peddinti
- Tero Aittokallio
Genome Medicine (2017)
Knocking down the obstacles to functional genomics data sharing
- Kaylene J. Simpson
- Jennifer A. Smith
Scientific Data (2017)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.