A nonparametric approach for relevance determination

Shahbaba, Babak

Abstract:The problem of evaluating a large number of factors in terms of their relevance to an outcome of interest arises in many research areas such as genetics, image processing, astrophysics, and neuroscience. In this paper, we argue that treating such problems as large-scale hypothesis testing does not reflect the usual motivation behind these studies, which is to select a subset of promising factors for further investigation. and leads investigators to rely on arbitrary selection mechanisms (e.g., setting the false discovery rate at 0.05) or unrealistic loss functions. Moreover, while we might be able to justify simplifying assumptions (e.g., parametric distributional forms for test statistics under the null and alternative hypotheses) for classic hypothesis testing situations (i.e., one hypothesis at a time), generalizing such assumptions to large-scale studies is restrictive and unnecessary. In accordance with the objective of such studies, we propose to treat them as relevance determination problems. This way, we are not constrained by the hypothesis testing framework. Moreover, instead of simply dividing factors into relevant and irrelevant groups, we propose a flexible Bayesian model that allows the relevant group to be divided into subgroups each with a different degree of relevance. We do not fix the number of these subgroups and treat it as as an unknown parameter, which could possibly be infinite. We therefore model the effect parameters for all factors as a mixture of a simple distribution for the irrelevant group and a Dirichlet process mixture distribution for the relevant group. Using simulated data, we show that our model performs substantially better than alternative methods such as those based on the false discovery rate. We also apply our method to two real large-scale studies. The objective of the first study is to interrogate the mutation status of p53 in cancer cell lines. The second study aims at identifying differentially expressed genes between two types of leukemia.

Subjects:	Methodology (stat.ME); Statistics Theory (math.ST); Quantitative Methods (q-bio.QM)
Cite as:	arXiv:1003.2390 [stat.ME]
	(or arXiv:1003.2390v1 [stat.ME] for this version)
	https://doi.org/10.48550/arXiv.1003.2390

Statistics > Methodology

Title:A nonparametric approach for relevance determination

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators