Statistics > Methodology
[Submitted on 11 Mar 2010 (this version), latest version 1 Mar 2012 (v3)]
Title:A nonparametric approach for relevance determination
View PDFAbstract:The problem of evaluating a large number of factors in terms of their relevance to an outcome of interest arises in many research areas such as genetics, image processing, astrophysics, and neuroscience. In this paper, we argue that treating such problems as large-scale hypothesis testing does not reflect the usual motivation behind these studies, which is to select a subset of promising factors for further investigation. and leads investigators to rely on arbitrary selection mechanisms (e.g., setting the false discovery rate at 0.05) or unrealistic loss functions. Moreover, while we might be able to justify simplifying assumptions (e.g., parametric distributional forms for test statistics under the null and alternative hypotheses) for classic hypothesis testing situations (i.e., one hypothesis at a time), generalizing such assumptions to large-scale studies is restrictive and unnecessary. In accordance with the objective of such studies, we propose to treat them as relevance determination problems. This way, we are not constrained by the hypothesis testing framework. Moreover, instead of simply dividing factors into relevant and irrelevant groups, we propose a flexible Bayesian model that allows the relevant group to be divided into subgroups each with a different degree of relevance. We do not fix the number of these subgroups and treat it as as an unknown parameter, which could possibly be infinite. We therefore model the effect parameters for all factors as a mixture of a simple distribution for the irrelevant group and a Dirichlet process mixture distribution for the relevant group. Using simulated data, we show that our model performs substantially better than alternative methods such as those based on the false discovery rate. We also apply our method to two real large-scale studies. The objective of the first study is to interrogate the mutation status of p53 in cancer cell lines. The second study aims at identifying differentially expressed genes between two types of leukemia.
Submission history
From: Babak Shahbaba [view email][v1] Thu, 11 Mar 2010 19:11:09 UTC (387 KB)
[v2] Tue, 20 Apr 2010 23:19:38 UTC (410 KB)
[v3] Thu, 1 Mar 2012 03:18:03 UTC (345 KB)
Current browse context:
stat.ME
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
Connected Papers (What is Connected Papers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.