WO2004065545A2

WO2004065545A2 - Diagnosis and prognosis of breast cancer patients

Info

Publication number: WO2004065545A2
Application number: PCT/US2004/001100
Authority: WO
Inventors: Laura Johanna Van't Veer; Yudong He
Original assignee: Rosetta Inpharmatics Llc.; The Netherlands Cancer Institute
Priority date: 2003-01-15
Filing date: 2004-01-15
Publication date: 2004-08-05
Also published as: JP4619350B2; US20040058340A1; EP1590433A2; CA2513642A1; JP2006519591A; WO2004065545A3; EP1590433A4; US7171311B2

Abstract

The present invention relates to genetic markers whose expression is correlated with breast cancer. Specifically, the invention provides sets of markers whose expression patterns can be used to differentiate clinical conditions associated with breast cancer, such as the presence or absence of the estrogen receptor ESR1, and BRCA1 and sporadic tumors, and to provide information on the likelihood of tumor distant metastases within five years of initial diagnosis. The invention relates to methods of using these markers to distinguish these conditions. The invention also provides methods of classifying and treating patients based on prognosis. The invention also relates to kits containing ready-to-use microarrays and computer software for data analysis using the diagnostic, prognostic and statistical methods disclosed herein.

Description

DIAGNOSIS AND PROGNOSIS OF BREAST CANCER PATIENTS

This application claims priority to United States Application No. 10/342,887, filed January 15, 2003, which is incorporated by reference herein in its entirety.

This application includes a Sequence Listing submitted on compact disc, recorded on two compact discs, including one duplicate, containing Filename 9301188228.txt, of size 6,634,308 bytes, created January 14, 2004. The sequence listing on the compact discs is incorporated by reference herein in its entirety.

1. FIELD OF THE INVENTION

The present invention relates to the identification of marker genes useful in the diagnosis and prognosis of breast cancer. More particularly, the invention relates to the identification of a set of marker genes associated with breast cancer, a set of marker genes differentially expressed in estrogen receptor (+) versus estrogen receptor (-) tumors, a set of marker genes differentially expressed in BRCAl versus sporadic tumors, and a set of marker genes differentially expressed in sporadic tumors from patients with good clinical prognosis (i.e., metastasis- or disease-free in at least 5 years of follow-up time since diagnosis) versus patients with poor clinical prognosis (i.e., metastasis or disease occurred within 5 years since diagnosis). For each of the marker sets above, the invention further relates to methods of distinguishing the breast cancer-related conditions. The invention further provides methods for determining the course of treatment of a patient with breast cancer.

2. BACKGROUND OF THE INVENTION The increased number of cancer cases reported in the United States, and, indeed, around the world, is a major concern. Currently there are only^' a handful of treatments available for specific types of cancer, and these provide no guarantee of success. In order to be most effective, these treatments require not only an early detection of the malignancy, but a reliable assessment of the severity of the malignancy.

The incidence of breast cancer, a leading cause of death in women, has been gradually increasing in the United States over the last thirty years. Its cumulative risk is relatively high; 1 in S women are expected to develop some type of breast cancer by age 85 in the United States. In fact, breast cancer is the most common cancer in women and the second most common cause of cancer death in the United States. In 1997, it was estimated that 181,000 new cases were reported in the U.S., and that 44,000 people would die of breast cancer (Parker et al, CA Cancer J. Clin. 47:5-27 (1997); Chu et al, J. Nat. Cancer Inst. 88:1571-1579 (1996)). While mechanism of tumorigenesis for most breast carcinomas is largely unknown, there are genetic factors that can predispose some women to developing breast cancer (Miki et al, Science, 266:66-71(1994)). The discovery and characterization of BRCAl and BRCA2 has recently expanded our knowledge of genetic factors which can contribute to familial breast cancer. Germ-line mutations within these two loci are associated with a 50 to 85% lifetime risk of breast and or ovarian cancer (Casey, Curr. Opin. Oncol. 9:88-93 (1997); Marcus et al., Cancer 77:697-709 (1996)). Only about 5% to 10% of breast cancers are associated with breast cancer susceptibility genes, BRCAl and BRCAL The cumulative lifetime risk of breast cancer for women who carry the mutant BRCAl is predicted to be approximately 92%, while the cumulative lifetime risk for the non-carrier majority is estimated to be approximately 10%. BRCAl is a tumor suppressor gene that is involved in DNA repair and cell cycle control, which are both important for the maintenance of genomic stability. More than 90% of all mutations reported so far result in a premature truncation of the protein product with abnormal or abolished function. The histology of breast cancer in BRCAl mutation carriers differs from that in sporadic cases, but mutation analysis is the only way to find the carrier. Like BRCAl, BRCA2 is involved in the development of breast cancer, and like BRCAl plays a role in DNA repair. However, unlike BRCAl, it is not involved in ovarian cancer.

Other genes have been linked to breast cancer, for example c-erb-2 (HER2) and p53 (Beenken et al., Ann. Surg. 233(5):630-638 (2001). Overexpression of c-erb-2 (HER2) and p53 have been correlated with poor prognosis (Rudolph et al., Hum. Pathol. 32(3):311-319 (2001), as has been aberrant expression products of mdm2 (Lukas et al, Cancer Res. 61(7):3212-3219 (2001) and cyclinl and p27 (Porter & Roberts, International Publication WO98/33450, published August 6, 1998). However, no other clinically useful markers consistently associated with breast cancer have been identified. Sporadic tumors, those not currently associated with a known germline mutation, constitute the majority of breast cancers. It is also likely that other, non-genetic factors also have a significant effect on the etiology of the disease. Regardless of the cancer's origin, breast cancer morbidity and mortality increases significantly if it is not detected early in its progression. Thus, considerable effort has focused on the early detection of cellular transformation and tumor formation in breast tissue.

A marker-based approach to tumor identification and characterization promises improved diagnostic and prognostic reliability. Typically, the diagnosis of breast cancer requires histopathological proof of the presence of the tumor. In addition to diagnosis, histopathological examinations also provide information about prognosis and selection of treatment regimens. Prognosis may also be established based upon clinical parameters such as tumor size, tumor grade, the age of the patient, and lymph node metastasis. Diagnosis and/or prognosis may be determined to varying degrees of effectiveness by direct examination of the outside of the breast, or through mammography or other X-ray imaging methods (Jatoi, Am. J. Surg. 177:518-524 (1999)). The latter approach is not without considerable cost, however. Every time a mammogram is taken, the patient incurs a small risk of having a breast tumor induced by the ionizing properties of the radiation used during the test, hi addition, the process is expensive and the subjective interpretations of a technician can lead to imprecision. For example, one study showed major clinical disagreements for about one-third of a set of mammograms that were interpreted individually by a surveyed group of radiologists. Moreover, many women find that undergoing a mammogram is a painful experience. Accordingly, the National Cancer Institute has not recommended mammograms for women under fifty years of age, since this group is not as likely to develop breast cancers as are older women. It is compelling to note, however, that while only about 22% of breast cancers occur in women under fifty, data suggests that breast cancer is more aggressive in pre- menopausal women. In clinical practice, accurate diagnosis of various subtypes of breast cancer is important because treatment options, prognosis, and the likelihood of therapeutic response all vary broadly depending on the diagnosis. Accurate prognosis, or determination of distant metastasis-free survival could allow the oncologist to tailor the administration of adjuvant chemotherapy, with women having poorer prognoses being given the most aggressive treatment. Furthermore, accurate prediction of poor prognosis would greatly impact clinical trials for new breast cancer therapies, because potential study patients could then be stratified according to prognosis. Trials could then be limited to patients having poor prognosis, in turn making it easier to discern if an experimental therapy is efficacious. To date, no set of satisfactory predictors for prognosis based on the clinical information alone has been identified. The detection of BRCAl or BRCA2 mutations represents a step towards the design of therapies to better control and prevent the appearance of these tumors. However, there is no equivalent means for the diagnosis of patients with sporadic tumors, the most common type of breast cancer tumor, nor is there a means of differentiating subtypes of breast cancer.

Adjuvant systemic therapy has been shown to substantially improve the disease-free and overall survival in both premenopausal and postmenopausal women up to age 70 with lymph node negative and lymph node positive breast cancer. See Early Breast Cancer Trialists' Collaborative Group, Lancet 352(9132): 930-942 (1998); Early Breast Cancer Trialists' Collaborative Group, Lancet 351(9114):1451-1467 (1998). The absolute benefit from adjuvant treatment is larger for patients with poor prognostic features and this has resulted in the policy to select only these so-called 'high-risk' patients for adjuvant chemotherapy. Goldhirsch et al, Meeting highlights: International Consensus Panel on the Treatment of Primary Breast Cancer, Seventh International Conference on Adjuvant Therapy of Primary Breast Cancer, J Clin. Oncol. 19(18):3817- 3827 (2001); Eifel et ah, National Institutes of Health Consensus Development Conference Statement: Adjuvant Therapy for Breast Cancer, November 1-3, 2000, J. Natl. Cancer Inst. 93(13):979-989 (2001). Accepted prognostic and predictive factors in breast cancer include age, tumor size, axillary lymph node status, histological tumor type, pathological grade and hormone receptor status. A large number of other factors has been investigated for their potential to predict disease outcome, but these have in general only limited predictive power. Isaacs et al., Semin. Oncol. 28(l):53-67 (2001).

Using gene expression profiling with cDNA microarrays, Perou et al. showed that there are several subgroups of breast cancer patients based on unsupervised cluster analysis: those of "basal type" and those of "luminal type." Perou et al, Nature 406(6797) :141-152 (2000). These subgroups differ with respect to outcome of disease in patients with locally advanced breast cancer. Sorlie et al, Proc. Natl. Acad. Sci. U.S.A. 98(19):10869-10874 (2001). In addition, microarray analysis has been used to identify diagnostic categories, e.g., BRCAl and 2 (Hedenfalk et al, N Engl J. Med. 344(8):539- 548 (2001); van 't Veer et al., Nature 415(6871):530-536 (2002)); estrogen receptor (Perou, supra; van't Veer, supra; Gravberger et al, Cancer. Res. 61(16):5979-5984 (2001)) and lymph node status (West et al, Proc. Natl. Acad. Sci. U.S.A. 98(20):11462- 11467 (2001); Ahi et al, Lancet 359(930l):l31-132 (2002)). 3. SUMMARY OF THE INVENTION The invention provides gene marker sets that distinguish various types and subtypes of breast cancer, and methods of use therefor. In one embodiment, the invention provides a method for classifying a cell sample as ER(+) or ER(-) comprising detecting a difference in the expression of a first plurality of genes relative to a control, said first plurality of genes consisting of at least 5 of the genes corresponding to the markers listed in Table 1. In specific embodiments, said plurality of genes consists of at least 50, 100, 200, 500, 1000, up to 2,460 of the gene markers listed in Table 1. In another specific embodiment, said plurality of genes consists of each of the genes corresponding to the 2,460 markers listed in Table 2. In another specific embodiment, said plurality consists of the 550 markers listed in Table 2. In another specific embodiment, said control comprises nucleic acids derived from a pool of tumors from individual sporadic patients. In another specific embodiment, said detecting comprises the steps of: (a) generating an ER(+) template by hybridization of nucleic acids derived from a plurality of ER(+) patients within a plurality of sporadic patients against nucleic acids derived from a pool of tumors from individual sporadic patients; (b) generating an ER(-) template by hybridization of nucleic acids derived from a plurality of ER(-) patients within said plurality of sporadic patients against nucleic acids derived from said pool of tumors from individual sporadic patients within said plurality; (c) hybridizing nucleic acids derived from an individual sample against said pool; and (d) determining the similarity of marker gene expression in the individual sample to the ER(+) template and the ER(-) template, wherein if said expression is more similar to the ER(+) template, the sample is classified as ER(+), and if said expression is more similar to the ER(-) template, the sample is classified as ER(-). The invention further provides the above methods, applied to the classification of samples as BRCAl or sporadic, and classifying patients as having good prognosis or poor prognosis. For the i?RC4i/sporadic gene markers, the invention provides that the method may be used wherein the plurality of genes is at least 5, 20, 50, 100, 200 or 300 of the 5RC4i/sporadic markers listed in Table 3. In a specific embodiment, the optimum 100 markers listed in Table 4 are used. For the prognostic markers, the invention provides that at least 5, 20, 50, 100, or 200 gene markers listed in Table 5 may be used. In a specific embodiment, the optimum 70 markers listed in Table 6 are used.

The invention further provides that markers may be combined. Thus, in one embodiment, at least 5 markers from Table 1 are used in conjunction with at least 5 markers from Table 3. In another embodiment, at least 5 markers from Table 5 are used in conjunction with at least 5 markers from Table 3. In another embodiment, at least 5 markers from Table 1 are used in conjunction with at least 5 markers from Table 5. In another embodiment, at least 5 markers from each of Tables 1, 3, and 5 are used simultaneously.

The invention further provides a method for classifying a sample as ER(+) or ER(-) by calculating the similarity between the expression of at least 5 of the markers listed in Table 1 in the sample to the expression of the same markers in an ER(-) nucleic acid pool and an ER(+) nucleic acid pool, comprising the steps of: (a) labeling nucleic acids derived from a sample, with a first fluorophore to obtain a first pool of fluorophore- labeled nucleic acids; (b) labeling with a second fluorophore a first pool of nucleic acids derived from two or more ER(+) samples, and a second pool of nucleic acids derived from two or more ER(-) samples; (c) contacting said first fluorophore-labeled nucleic acid and said first pool of second fluorophore-labeled nucleic acid with said first microarray under conditions such that hybridization can occur, and contacting said first fluorophore- labeled nucleic acid and said second pool of second fluorophore-labeled nucleic acid with said second microarray under conditions such that hybridization can occur, detecting at each of a plurality of discrete loci on the first microarray a first fluorescent emission signal from said first fluorophore-labeled nucleic acid and a second fluorescent emission signal from said first pool of second fluorophore-labeled genetic matter that is bound to said first microarray under said conditions, and detecting at each of the marker loci on said second microarray said first fluorescent emission signal from said first fluorophore- labeled nucleic acid and a third fluorescent emission signal from said second pool of second fluorophore-labeled nucleic acid; (d) determining the similarity of the sample to the ER(-) and ER(+) pools by comparing said first fluorescence emission signals and said second fluorescence emission signals, and said first emission signals and said third fluorescence emission signals; and (e) classifying the sample as ER(+) where the first fluorescence emission signals are more similar to said second fluorescence emission signals than to said third fluorescent emission signals, and classifying the sample as ER(-) where the first fluorescence emission signals are more similar to said third fluorescence emission signals than to said second fluorescent emission signals, wherein said similarity is defined by a statistical method. The invention further provides that the other disclosed marker sets may be used in the above method to distinguish BRCAl from sporadic tumors, and patients with poor prognosis from patients with good prognosis. In a specific embodiment, said similarity is calculated by determining a first sum of the differences of expression levels for each marker between said first fluorophore-labeled nucleic acid and said first pool of second fluorophore-labeled nucleic acid, and a second sum of the differences of expression levels for each marker between said first fluorophore-labeled nucleic acid and said second pool of second fluorophore- labeled nucleic acid, wherein if said first sum is greater than said second sum, the sample is classified as ER(-), and if said second sum is greater than said first sum, the sample is classified as ER(+). In another specific embodiment, said similarity is calculated by computing a first classifier parameter Pi between an ER(+) template and the expression of said markers in said sample, and a second classifier parameter P₂ between an ER(-) template and the expression of said markers in said sample, wherein said Pi and P are calculated according to the formula:

Equation (1)

wherein ^z' and 2 ^e ER(-) and ER(+) templates, respectively, and are calculated by averaging said second fluorescence emission signal for each of said markers in said first pool of second fluorophore-labeled nucleic acid and said third fluorescence emission signal for each of said markers in said second pool of second fluorophore-

labeled nucleic acid, respectively, and wherein ^ is said first fluorescence emission signal of each of said markers in the sample to be classified as ER(+) or ER(-), wherein the expression of the markers in the sample is similar to ER(+) if Pi < P₂, and similar to ER(-) ifPι > P₂.

The invention further provides a method for identifying marker genes the expression of which is associated with a particular phenotype. In one embodiment, the invention provides a method for determining a set of marker genes whose expression is associated with a particular phenotype, comprising the steps of: (a) selecting the phenotype having two or more phenotype categories; (b) identifying a plurality of genes wherein the expression of said genes is correlated or anticorrelated with one of the phenotype categories, and wherein the correlation coefficient for each gene is calculated according to the equation:

Equation (2) wherein ^c is a number representing said phenotype category and r is the logarithmic expression ratio across all the samples for each individual gene, wherein if the correlation coefficient has an absolute value of a threshold value or greater, said expression of said gene is associated with the phenotype category, and wherein said plurality of genes is a set of marker genes whose expression is associated with a particular phenotype. The threshold depends upon the number of samples used; the threshold can be calculated as 3

X l/^jn -3 , where l/ n -3 is the distribution width and n = the number of samples. In a specific embodiment where n = 98, said threshold value is 0.3. In a specific embodiment, said set of marker genes is validated by: (a) using a statistical method to randomize the association between said marker genes and said phenotype category, thereby creating a control correlation coefficient for each marker gene; (b) repeating step (a) one hundred or more times to develop a frequency distribution of said control correlation coefficients for each marker gene; (c) determining the number of marker genes having a control correlation coefficient of a threshold value or above, thereby creating a control marker gene set; and (d) comparing the number of control marker genes so identified to the number of marker genes, wherein if the p value of the difference between the number of marker genes and the number of control genes is less than 0.01, said set of marker genes is validated. In another specific embodiment, said set of marker genes is optimized by the method comprising: (a) rank-ordering the genes by amplitude of correlation or by significance of the correlation coefficients, and (b) selecting an arbitrary number of marker genes from the top of the rank-ordered list. The threshold value depends upon the number of samples tested.

The invention further provides a method for assigning a person to one of a plurality of categories in a clinical trial, comprising determining for each said person the level of expression of at least five of the prognosis markers listed in Table 6, determining therefrom whether the person has an expression pattern that correlates with a good prognosis or a poor prognosis, and assigning said person to one category in a clinical trial if said person is determined to have a good prognosis, and a different category if that person is determined to have a poor prognosis. The invention further provides a method for assigning a person to one of a plurality of categories in a clinical trial, where each of said categories is associated with a different phenotype, comprising determining for each said person the level of expression of at least five markers from a set of markers, wherein said set of markers includes markers associated with each of said clinical categories, determining therefrom whether the person has an expression pattern that correlates with one of the clinical categories, an assigning said person to one of said categories if said person is deteraiined to have a phenotype associated with that category.

The invention further provides a method of classifying a first cell or organism as having one of at least two different phenotypes, said at least two different phenotypes comprising a first phenotype and a second phenotype, said method comprising: (a) comparing the level of expression of each of a plurality of genes in a first sample from the first cell or organism to the level of expression of each of said genes, respectively, in a pooled sample from a plurality of cells or organisms, said plurality of cells or organisms comprising different cells or organisms exhibiting said at least two different phenotypes, respectively, to produce a first compared value; (b) comparing said first compared value to a second compared value, wherein said second compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said first phenotype to the level of expression of each of said genes, respectively, in said pooled sample; (c) comparing said first compared value to a third compared value, wherein said third compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said second phenotype to the level of expression of each of said genes, respectively, in said pooled sample, (d) optionally carrying out one or more times a step of comparing said first compared value to one or more additional compared values, respectively, each additional compared value being the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having a phenotype different from said first and second phenotypes but included among said at least two different phenotypes, to the level of expression of each of said genes, respectively, in said pooled sample; and (e) determining to which of said second, third and, if present, one or more additional compared values, said first compared value is most similar, wherein said first cell or organism is determined to have the phenotype of the cell or organism used to produce said compared value most similar to said first compared value.

In a specific embodiment of the above method, said compared values are each ratios of the levels of expression of each of said genes. In another specific embodiment, each of said levels of expression of each of said genes in said pooled sample are normalized prior to any of said comparing steps, hi another specific embodiment, normalizing said levels of expression is carried out by dividing each of said levels of expression by the median or mean level of expression of each of said genes or dividing by the mean or median level of expression of one or more housekeeping genes in said pooled sample, hi a more specific embodiment, said normalized levels of expression are subjected to a log transform and said comparing steps comprise subtracting said log transform from the log of said levels of expression of each of said genes in said sample from said cell or organism. In another specific embodiment, said at least two different phenotypes are different stages of a disease or disorder. In another specific embodiment, said at least two different phenotypes are different prognoses of a disease or disorder. In yet another specific embodiment, said levels of expression of each of said genes, respectively, in said pooled sample or said levels of expression of each of said genes in a sample from said cell or organism characterized as having said first phenotype, said second phenotype, or said phenotype different from said first and second phenotypes, respectively, are stored on a computer. The invention further provides microarrays comprising the disclosed marker sets. In one embodiment, the invention provides a microanay comprising at least 5 markers derived from any one of Tables 1-6, wherein at least 50%) of the probes on the microarray are present in any one of Tables 1-6. In more specific embodiments, at least 60%, 70%, 80%, 90%, 95% or 98% of the probes on said microarray are present in any one of Tables 1-6.

In another embodiment, the invention provides a microarray for distinguishing ER(+) and ER(-) cell samples comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a plurality of genes, said plurality consisting of at least 5 of the genes corresponding to the markers listed in Table 1 or Table 2, wherein at least 50% of the probes on the microarray are present in any one of Table 1 or Table 2. In yet another embodiment, the invention provides a microarray for distinguishing BRCAl -type and sporadic tumor-type cell samples comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a plurality of genes, said plurality consisting of at least 5 of the genes corresponding to the markers listed in Table 3 or Table 4, wherein at least 50% of the probes on the microarray are present in any one of Table 3 or Table 4. In still another embodiment, the invention provides a microarray for distinguishing cell samples from patients having a good prognosis and cell samples from patients having a poor prognosis comprising a positionally-addressable array of polynucleotide probes bound to a support, said polynucleotide probes comprising a plurality of polynucleotide probes of different nucleotide sequences, each of said different nucleotide sequences comprising a sequence complementary and hybridizable to a plurality of genes, said plurality consisting of at least 5 of the genes corresponding to the markers listed in Table 5 or Table 6, wherein at least 50% of the probes on the microarray are present in any one of Table 5 or Table 6. The invention further provides for microarrays comprising at least 5, 20, 50, 100, 200, 500, 100, 1,250, 1,500, 1,750, or 2,000 of the ER-status marker genes listed in Table 1, at least 5, 20, 50, 100, 200, or 300 of the BRCAl sporadic marker genes listed in Table 3, or at least 5, 20, 50, 100 or 200 of the prognostic marker genes listed in Table 5, in any combination, wherein at least 50%, 60%, 70%, 80%, 90%, 95% or 98% of the probes on said microarrays are present in Table 1, Table 3 and/or Table 5.

The invention further provides a kit for determining the ER-status of a sample, comprising at least two microarrays each comprising at least 5 of the markers listed in Table 1, and a computer system for determining the similarity of the level of nucleic acid derived from the markers listed in Table 1 in a sample to that in an ER(-) pool and an ER(+) pool, the computer system comprising a processor, and a memory encoding one or more programs coupled to the processor, wherein the one or more programs cause the processor to perform a method comprising computing the aggregate differences in expression of each marker between the sample and ER(-) pool and the aggregate differences in expression of each marker between the sample and ER(+) pool, or a method comprising determining the correlation of expression of the markers in the sample to the expression in the ER(-) and ER(+) pools, said correlation calculated according to Equation (4). The invention provides for kits able to distinguish BRCAl and sporadic tumors, and samples from patients with good prognosis from samples from patients with poor prognosis, by inclusion of the appropriate marker gene sets. The invention further provides a kit for determining whether a sample is derived from a patient having a good prognosis or a poor prognosis, comprising at least one microarray comprising probes to at least 5 of the genes corresponding to the markers listed in Table 5, and a computer readable medium having recorded thereon one or more programs for determining the similarity of the level of nucleic acid derived from the markers listed in Table 5 in a sample to that in a pool of samples derived from individuals having a good prognosis and a pool of samples derived from individuals having a good prognosis, wherein the one or more programs cause a computer to perform a method comprising computing the aggregate differences in expression of each marker between the sample and the good prognosis pool and the aggregate differences in expression of each marker between the sample and the poor prognosis pool, or a method comprising determining the correlation of expression of the markers in the sample to the expression in the good prognosis and poor prognosis pools, said correlation calculated according to Equation (3). The invention further provides a method for classifying a breast cancer patient according to prognosis, comprising: (a) comparing the respective levels of expression of at least five genes for which markers are listed in Table 5 in a cell sample taken from said breast cancer patient to respective control levels of expression of said at least five genes; and (b) classifying said breast cancer patient according to prognosis of his or her breast cancer based on the similarity between said levels of expression in said cell sample and said control levels. In a specific embodiment of this method, step (b) comprises determining whether said similarity exceeds one or more predetermined threshold values of similarity. In another more specific embodiment of this method, said control levels are the mean levels of expression of each of said at least five genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have no distant metastases within five years of initial diagnosis. In another specific embodiment of this method, said control levels comprise the expression levels of said genes in breast cancer patients who have had no distant metastases within five years of initial diagnosis. In another specific embodiment of this method, said control levels comprise, for each of said at least five genes, mean log intensity values stored on a computer. In another specific embodiment of this method, said control levels comprise, for each of said at least five genes, the mean log intensity values that are listed in Table 7. hi another specific embodiment of this method, said comparing step (a) comprises comparing the respective levels of expression of at least ten of said genes for which markers are listed in Table 5 in said cell sample to said respective control levels of said at least ten of said genes, wherein said control levels of expression of said at least ten genes are the average expression levels of each of said at least ten genes in a pool of tumor samples obtained from breast cancer patients who have had no distant metastases within five years of initial diagnosis. In another specific embodiment of this method, said comparing step (a) comprises comparing the respective levels of expression of at least 25 of said genes for which markers are listed in Table 5 in said cell sample to said respective control levels of expression of said at least 25 genes, wherein said control levels of expression of said at least 25 genes are the average expression levels of each of said at least 25 genes in a pool of tumor samples obtained from breast cancer patients who have had no distant metastases within five years of initial diagnosis. In another specific embodiment of this method, said comparing step (a) comprises comparing the respective levels of expression of each of said genes for which markers are listed in Table 6 in said cell sample to said respective control levels of expression of each of said genes for which markers are listed in Table 6, wherein said control levels of expression of each of said genes for which markers are listed in Table 6 are the average expression levels of each of said genes in a pool of tumor samples obtained from breast cancer patients who have had no distant metastases within five years of initial diagnosis.

The invention further provides for a method for classifying a breast cancer patient according to prognosis, comprising: (a) determining the similarity between the level of expression of each of at least five genes for which markers are listed in Table 5 in a cell sample taken from said breast cancer patient, to control levels of expression for each respective said at least five genes to obtain a patient similarity value; (b) providing selected first and second threshold values of similarity of said level of expression of each of said at least five genes to said control levels of expression to obtain first and second similarity threshold values, respectively, wherein said second similarity threshold indicates greater similarity to said control than does said first similarity threshold; and (c) classifying said breast cancer patient as having a first prognosis if said patient similarity value exceeds said first and said second similarity threshold values, a second prognosis if said level of expression of said genes exceeds said first similarity threshold value but does not exceed said second similarity threshold value, and a third prognosis if said level of expression of said genes does not exceed said first similarity threshold value or said second similarity threshold value. A specific embodiment of this method comprises determimng, prior to step (a), said level of expression of said at least five genes. In another specific embodiment of this method, said determining in step (a) is carried out by a method comprising determining the degree of similarity between the level of expression of each of said at least five genes in a sample taken from said breast cancer patient to the level of expression of each of said at least five genes in a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis. In another specific embodiment of this method, said determining in step (a) is carried out by a method comprising determining the difference between the absolute expression level of each of said at least five genes and the average expression level of each of said at least five genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis. In another specific embodiment of this method, said first threshold value and said second threshold value are coefficients of correlation to the mean expression level of each of said at least five genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis. In a more specific embodiment of this method, said first threshold similarity value and said second threshold similarity values are selected by a method comprising: (a) rank ordering in descending order said tumor samples that compose said pool of tumor samples by the degree of similarity between the level of expression of each said at least five genes in each of said tumor samples to the mean level of expression of said at least five genes of the remaining tumor samples that compose said pool to obtain a rank-ordered list, said degree of similarity being expressed as a similarity value; (b) determining an acceptable number of false negatives in said classifying step, wherein a false negative is a breast cancer patient for whom the expression levels of said at least five genes in said cell sample predicts that said breast cancer patient will have no distant metastases within the first five years after initial diagnosis, but who has had a distant metastasis within the first five years after initial diagnosis; (c) determining a similarity value above which in said rank ordered list fewer than said acceptable number of tumor samples are false negatives; (d) selecting said similarity value determined in step (c) as said first threshold similarity value; and (e) selecting a second similarity value, greater than said first similarity value, as said second threshold similarity value, hi an even more specific embodiment of this method, said second threshold similarity value is selected in step (e) by a method comprising determining which of said tumor samples, taken from said breast cancer patients having a distant metastasis within the first five years after initial diagnosis, in said rank ordered list has the greatest similarity value, and selecting said greatest similarity value as said second threshold similarity value. In another even more specific embodiment of this method, said first and second threshold similarity values are correlation coefficients, and said first threshold similarity value is 0.4 and said second threshold similarity value is greater than 0.4. In another even more specific embodiment of this method, said first and second threshold similarity values are correlation coefficients, and said second threshold similarity value is 0.636.

The invention further provides a method of classifying a breast cancer patient according to prognosis comprising the steps of: (a) contacting first nucleic acids derived from a tumor sample taken from said breast cancer patient, and second nucleic acids derived from two or more tumor samples from breast cancer patients who have had no distant metastases within five years of initial diagnosis, with an array under conditions such that hybridization can occur, said array comprising a positionally- addressable ordered array of polynucleotide probes bound to a solid support, said polynucleotide probes being complementary and hybridizable to at least five of the genes respectively for which markers are listed in Table 5, or the RNA encoded by said genes, and wherein at least 50% of the probes on said array are hybridizable to genes respectively for which markers are listed in Table 5, or to the RNA encoded by said genes; (b) detecting at each of a plurality of discrete loci on said array a first fluorescent emission signal from said first nucleic acids and a second fluorescent emission signal from said second nucleic acids that are bound to said array under said conditions; (c) calculating the similarity between said first fluorescent emission signals and said second fluorescent emission signals across said at least five genes respectively for which markers are listed in Table 5; and (d) classifying said breast cancer patient according to prognosis of his or her breast cancer based on the similarity between said first fluorescent emission signals and said second fluorescent emission signals across said at least five genes respectively for which markers are listed in Table 5.

The invention further provides for methods of assigning therapeutic regimen to breast cancer patients, hi one embodiment, the invention provides a method of assigning a therapeutic regimen to a breast cancer patient, comprising: (a) classifying said patient as having a "poor prognosis," "intermediate prognosis," or "very good prognosis" on the basis of the levels of expression of at least five genes for which markers are listed in Table 5; and (b) assigning said patient a therapeutic regimen, said therapeutic regimen (i) comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or (ii) comprising chemotherapy if said patient has any other combination of lymph node status and expression profile.

The invention also provides a method of assigning a therapeutic regimen to a breast cancer patient, comprising: (a) determining the lymph node status for said patient; (b) determining the level of expression of at least five genes for which markers are listed in Table 5 in a cell sample from said patient, thereby generating an expression profile; (c) classifying said patient as having a "poor prognosis," "intermediate prognosis," or "very good prognosis" on the basis of said expression profile; and (d) assigning said patient a therapeutic regimen, said therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or comprising chemotherapy if said patient has any other combination of lymph node status and classification. In a specific embodiment of this method, said therapeutic regimen assigned to lymph node negative patients classified as having an "intermediate prognosis" additionally comprises adjuvant hormonal therapy, hi another specific embodiment of this method, said classifying step (c) is carried out by a method comprising: (a) rank ordering in descending order a plurality of breast cancer tumor samples that compose a pool of breast cancer tumor samples by the degree of similarity between the level of expression of said at least five genes in each of said tumor samples and the level of expression of said at least five genes across all remaining tumor samples that compose said pool, said degree of similarity being expressed as a similarity value; (b) determining an acceptable number of false negatives in said classifying step, wherein a false negative is a breast cancer patient for whom the expression levels of said at least five genes in said cell sample predicts that said breast cancer patient will have no distant metastases within the first five years after initial diagnosis, but who has had a distant metastasis within the first five years after initial diagnosis; (c) determining a similarity value above which in said rank ordered list said acceptable number of tumor samples or fewer are false negatives; (d) selecting said similarity value determined in step (c) as a first threshold similarity value; (e) selecting a second similarity value, greater than said first similarity value, as a second threshold similarity value; and (f) detennining the similarity between the level of expression of each of said at least five genes in a breast cancer tumor sample from the breast cancer patient and the level of expression of each of said respective at least five genes in said pool, to obtain a patient similarity value, wherein if said patient similarity value equals or exceeds said second threshold similarity value, said patient is classified as having a "very good prognosis"; if said patient similarity value equals or exceeds said first threshold similarity value, but is less than said second threshold similarity value, said patient is classified as having an "intermediate prognosis"; and if said patient similarity value is less than said first threshold similarity value, said patient is classified as having a "poor prognosis." Another specific embodiment of this method comprises determining the estrogen receptor (ER) status of said patient, wherein if said patient is ER positive and lymph node negative, said therapeutic regimen assigned to said patient additionally comprises adjuvant hormonal therapy. In another specific embodiment of this method, said patient is 52 years of age or younger. In another specific embodiment of this method, said patient has stage I or stage II breast cancer. In yet another specific embodiment of this method, said patient is premenopausal.

The above methods may be computer-implemented. Thus, in another embodiment, the invention provides a computer program product for classifying a breast cancer patient according to prognosis, the computer program product for use in conjunction with a computer having a memory and a processor, the computer program product comprising a computer readable storage medium having a computer program encoded thereon, wherein said computer program product can be loaded into the one or more memory units of a computer and causes the one or more processor units of the computer to execute the steps of: (a) receiving a first data structure comprising the respective levels of expression of each of at least five genes for which markers are listed in Table 5 in a cell sample taken from said patient; (b) determining the similarity of the level of expression of each of said at least five genes to respective control levels of expression of said at least five genes to obtain a patient similarity value; (c) comparing said patient similarity value to selected first and second threshold values of similarity of said respective levels of expression of each of said at least five genes to said respective control levels of expression of said at least five genes, wherein said second threshold value of similarity indicates greater similarity to said respective control levels of expression of said at least five genes than does said first threshold value of similarity; and (d) classifying said patient as having a first prognosis if said patient similarity value exceeds said first and said second threshold similarity values; a second prognosis if said patient similarity value exceeds said first threshold similarity value but does not exceed said second threshold similarity value; and a third prognosis if said patient similarity value does not exceed said first threshold similarity value or said second threshold similarity value. In a specific embodiment of the computer program product, said first threshold value of similarity and said second threshold value of similarity are values stored in said computer. In another specific embodiment of the computer program product, said respective control levels of expression of said at least five genes is stored in said computer. In another specific embodiment of the computer program product, said first prognosis is a "very good prognosis"; said second prognosis is an "intermediate prognosis"; and said third prognosis is a "poor prognosis"; wherein said computer program may be loaded into the memory and further cause said one or more processor units of said computer to execute the step of assigning said breast cancer patient a therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or comprising chemotherapy if said patient has any other combination of lymph node status and expression profile. In a more specific embodiment, said clinical data includes the lymph node and estrogen receptor (ER) status of said breast cancer patient. In yet another specific embodiment, said computer program may be loaded into the memory and further causes said one or more processor units of the computer to execute the steps of receiving a data structure comprising clinical data specific to said breast cancer patient. In another specific embodiment, said respective control levels of expression of said at least five genes comprises a set of single-channel mean hybridization intensity values for each of said at least five genes, stored on said computer readable storage medium. In a more specific embodiment of this computer program product, said single-channel mean hybridization intensity values are log transformed. In another specific embodiment of the computer program product, said computer program product causes said processing unit to perform said comparing step (c) by calculating the difference between the level of expression of each of said at least five genes in said cell sample taken from said breast ^■ cancer patient and said respective control levels of expression of said at least five genes. In another specific embodiment of the computer program product, said computer program product causes said processing unit to perform said comparing step (c) by calculating the mean log level of expression of each of said at least five genes in said control to obtain a control mean log expression level for each gene, calculating the log expression level for each of said at least five genes in a breast cancer sample from said patient to obtain a patient log expression level, and calculating the difference between the patient log expression level and the control mean log expression for each of said at least five genes. In another specific embodiment of the computer program product, said computer program product causes said processing unit to perform said comparing step (c) by calculating similarity between the level of expression of each of said at least five genes in said cell sample taken from said patient and said respective control levels of expression of said at least five genes, wherein said similarity is expressed as a similarity value. In a more specific embodiment of this computer program product, said similarity value is a conelation coefficient.

4. BRIEF DESCRIPTION OF THE FIGURES FIG. 1 is a Venn-type diagram showing the overlap between the marker sets disclosed herein, including the 2,460 ER markers, the 430 i?RG4i/sporadic markers, and the 231 prognosis reporters.

FIG. 2 shows the experimental procedures for measuring differential changes in mRNA transcript abundance in breast cancer tumors used in this study. In each experiment, Cy5-labeled cRNA from one tumor X is hybridized on a 25k human microarray together with a Cy3 -labeled cRNA pool made of cRNA samples from tumors 1, 2, . . . N. The digital expression data were obtained by scanning and image processing. The error modeling allowed us to assign a p-value to each transcript ratio measurement.

FIG. 3 Two-dimensional clustering reveals two distinctive types of tumors. The clustering was based on the gene expression data of 98 breast cancer tumors over 4986 significant genes. Dark gray (red) presents up-regulation, light gray (green) represents down-regulation, black indicates no change in expression, and gray indicates that data is not available. 4986 genes were selected that showed a more than two fold change in expression ratios in more than five experiments. Selected clinical data for test results of BR CA1 mutations, estrogen receptor (ER), and proestrogen receptor (PR), tumor grade, lymphocytic infiltrate, and angioinvasion are shown at right. Black denotes negative and white denotes positive. The dominant pattern in the lower part consists of 36 patients, out of which 34 are ER-negative (total 39), and 16 are BR CAl-mutation carriers (total 18).

FIG. 4 A portion of unsupervised clustered results as shown in FIG. 3. ESRl (the estrogen receptor gene) is co-regulated with a set of genes that are strongly co- regulated to fonn a dominant pattern.

FIG. 5 A Histogram of correlation coefficients of significant genes between their expression ratios and estrogen-receptor (ER) status (i.e., ER level). The histogram for experimental data is shown as a gray line. The results of one Monte-Carlo trial is shown in solid black. There are 2,460 genes whose expression data correlate with ER status at a level higher than 0.3 or anti-correlated with ER status at a level lower than -0.3. FIG. 5B The distribution of the number of genes that satisfied the same selection criteria (amplitude of coreelation above 0.3) from 10,000 Monte-Carlo runs. It is estimated that this set of 2,460 genes reports ER status at a confidence level of p >99.99%. FIG. 6 Classification Type 1 and Type 2 error rates as a function of the number (out of 2,460) marker genes used in the classifier. The combined error rate is lowest when approximately 550 marker genes are used.

FIG. 7 Classification of 98 tumor samples as ER(+) or ER(-) based on expression levels of the 550 optimal marker genes. ER(+) samples (above white line) exhibit a clearly different expression pattern that ER(-) samples (below white line).

FIG. 8 Conelation between expression levels in samples from each patient and the average profile of the ER(-) group vs. correlation with the ER(+) group. Squares represent samples from clinically ER(-) patients; dots represent samples from clinically ER(+) patients. FIG. 9A Histogram of correlation coefficients of gene expression ratio of each significant gene with the BRCAl mutation status is shown as a solid line. The dashed line indicates a frequency distribution obtained from one Monte-Carlo run. 430 genes exhibited an amplitude of correlation or anti-correlation greater than 0.35.

FIG. 9B Frequency distribution of the number of genes that exhibit an amplitude of correlation or anti-correlation greater than 0.35 for the 10,000 Monte-Carlo run control. Mean = 115 p(n > 430) = 0.48% and p(>430/2) = 9.0%.

FIG. 10 Classification type 1 and type 2 error rates as a function of the number of discriminating genes used in the classifier (template). The combined error rate is lowest when approximately 100 discriminating marker genes are used. FIG. 1 IA The classification of 38 tumors in the ER(-) group into two subgroups, BRCAl and sporadic, by using the optimal set of 100 discriminating marker genes. Patients above the white line are characterized by BRCAl -related patterns.

FIG. 1 IB Correlation between expression levels in samples from each ER(-) patient and the average profile of the BRCAl group vs. correlation with the sporadic group. Squares represent samples from patients with sporadic-type tumors; dots represent samples from patients carrying the BRCAl mutation.

FIG. 12A Histogram of correlation coefficients of gene expression ratio of each significant gene with the prognostic category (distant metastases group and no distant metastases group) is shown as a solid line. The distribution obtained from one Monte-Carlo run is shown as a dashed line. The amplitude of conelation or anti- conelation of 231 marker genes is greater than 0.3.

FIG. 12B Frequency distribution of the number of genes whose amplitude of correlation or anti-correlation was greater than 0.3 for 10,000 Monte-Carlo runs. FIG. 13 The distant metastases group classification error rate for type 1 and type 2 as a function of the number of discriminating genes used in the classifier. The combined error rate is lowest when approximately 70 discriminating marker genes are used.

FIG. 14 Classification of 78 sporadic tumors into two prognostic groups, distant metastases (poor prognosis) and no distant metastases (good prognosis) using the optimal set of 70 discriminating marker genes. Patients above the white line are characterized by good prognosis. Patients below the white line are characterized by poor prognosis.

FIG. 15 Correlation between expression levels in samples from each patient and the average profile of the good prognosis group vs. correlation with the poor prognosis group. Squares represent samples from patients having a poor prognosis; dots represent samples from patients having a good prognosis. Black squares represent the 'reoccurred' patients and the gray dots represent the 'non-reoccurred'. A total of 13 out of 78 were mis-classified. FIG. 16 The reoccunence probability as a function of time since diagnosis. Group A and group B were predicted by using a leave-one-out method based on the optimal set of 70 discriminating marker genes. The 43 patients in group A consists of 37 patients from the no distant metastases group and 6 patients from the distant metastases group. The 35 patients in group B consists of 28 patients from the distant metastases group and 7 patients from the no distant metastases group.

FIG. 17 The distant metastases probability as a function of time since diagnosis for ER(+) (yes) or ER(-) (no) individuals.

FIG. 18 The distant metastases probability as a function of time since diagnosis for progesterone receptor (PR)(+) (yes) or PR(-) (no) individuals. FIG. 19A, B The distant metastases probability as a function of time since diagnosis. Groups were defined by the tumor grades.

FIG. 20A Classification of 19 independent sporadic tumors into two prognostic groups, distant metastases and no distant metastases, using the 70 optimal marker genes. Patients above the white line have a good prognosis. Patients below the white line have a poor prognosis.

FIG. 20B Correlation between expression ratios of each patient and the average expression ratio of the good prognosis group is defined by the training set versus the correlation between expression ratios of each patient and the average expression ratio of the poor prognosis training set. Of nine patients in the good prognosis group, three are from the "distant metastases group"; often patients in the good prognosis group, one patient is from the "no distant metastases group". This error rate of 4 out of 19 is consistent with 13 out of 78 for the initial 78 patients. FIG. 20C The reoccurrence probability as a function of time since diagnosis for two groups predicted based on expression of the optimal 70 marker genes.

FIG. 21A Sensitivity vs. 1-specificity for good prognosis classification.

FIG. 21B Sensitivity vs. 1-specificity for poor prognosis classification.

FIG. 21 C Total error rate as a function of threshold on the modeled likelihood. Six clinical parameters (ER status, PR status, tumor grade, tumor size, patient age, and presence or absence of angioinvasion) were used to perform the clinical modeling.

FIG. 22 Comparison of the log(ratio) of individual samples using the "material sample pool" vs. mean subtracted log(intensity) using the "mathematical sample pool" for 70 reporter genes in the 78 sporadic tumor samples. The "material sample pool" was constructed from the 78 sporadic tumor samples.

FIG. 23 A Results of the "leave one out" cross validation based on single channel data. Samples are grouped according to each sample's coefficient of correlation to the average "good prognosis" profile and "poor prognosis" profile for the 70 genes examined. The white line separates samples from patients classified as having poor prognoses (below) and good prognoses (above).

FIG. 23B Scatter plot of coefficients of correlation to the average expression in "good prognosis" samples and "poor prognosis" samples. The false positive rate (i.e., rate of incorrectly classifying a sample as being from a patient having a good prognosis as being one from a patient having a poor prognosis) was 10 out of 44, and the false negative rate is 6 out of 34.

FIG. 24A Single-channel hybridization data for samples ranked according to the coefficients of correlation with the good prognosis classifier. Samples classified as "good prognosis" lie above the white line, and those classified as "poor prognosis" lie below.

FIG. 24B Scatterplot of sample correlation coefficients, with three incorrectly classified samples lying to the right of the threshold correlation coefficient value. The threshold correlation value was set at 0.2727 to limit the false negatives to approximately 10% of the samples.

FIG. 25 A Gene expression pattern of the 70 optimal prognosis marker genes (see Example 4) for a consecutive series of 295 breast carcinomas. Each row represents a prognostic profile of the 70 marker genes for one tumor and each column represents the relative expression abundance of one gene. Dark gray indicates high mRNA expression in the tumor relative to the reference mRNA (pooled mRNA from all tumor samples); light gray indicates low expression relative to the reference mRNA. The horizontal dotted line is the previously determined separation between good and poor prognosis signature subgroups. Tumors are rank-ordered according to their correlation with the average profile in tumors of good prognosis patients (CI); the most highly correlated tumors lie at the top of the plot.

FIG. 25B Time in years to distant metastases as a first event (black dots) or the time of follow-up for all other patients (gray dots).

FIG. 25C Selected clinical characteristics: lymph node status (black = pN+, white = pNO); metastases as first event (black = yes, white = no); death (black = yes, white = no).

FIGS. 26A-26F Kaplan-Meier plots for the cohort of 295 breast cancer patients. FIG. 26A shows the metastasis-free probability of all 295 patients according to "good prognosis" (n=l 15, upper line) and "poor prognosis" (n=180, lower line) signature. FIG. 26B shows the overall survival of all 295 patients according to "good prognosis" and "poor prognosis" signature. FIG. 26C shows the metastasis-free probability of lymph node negative patients within the 295 tumor cohort. FIG. 26D shows the overall survival of lymph node negative patients. FIG. 26E shows the metastasis-free probability for lymph node positive patients. FIG. 26F shows the overall survival of lymph node positive patients. For each of the plots, the number of patients who are metastasis-free (FIGS. 26A, C, E) or have survived (FIGS. 26B, D, F), and for whom information is available, at each time point (years) are indicated for "good signature" patients (upper line; upper row of numbers) or "poor signature" patients (lower line; lower row of numbers). For each plot, P indicates the P-value of the log-rank test. FIGS. 27A-27G Kaplan-Meier plots of the metastasis-free probabilities for 151 lymph node negative breast cancer patients within the 295 tumor cohort. FIG. 27A shows the metastasis-free probabilities of the "good prognosis" and "poor prognosis" groups as identified by molecular profiling using the 70 optimal marker genes (i.e., "good prognosis" and "poor prognosis" signatures; see Example 4). FIG. 27B shows the metastasis-free probabilities of "low-risk" and "high-risk" groups as identified by "St. Gallen" criteria. FIG. 27C shows the metastasis-free probabilities of "low-risk" and "high-risk" signature groups as identified by "NIH consensus" criteria. FIG. 27D shows the "St. Gallen" "high-risk" group (n=129) divided into "good prognosis" and "poor prognosis" signature groups by profiling. FIG. 27E shows the "NIH" "high-risk" group (n=140) divided into "good prognosis" and "poor prognosis" signature groups by profiling. FIG. 27F shows the "St. Gallen" "low-risk" group (n=22) divided into "good prognosis" and "poor prognosis" signature groups by profiling. FIG. 27G shows the "NIH" "low-risk" group (n=l 1) divided into "good prognosis" and "poor prognosis" signature groups by profiling. Patients at risk at each time point (years; see description of FIG 26) are indicated in each plot for "good signature" patients (upper line; upper row of numbers) or "poor signature" patients (lower line; lower row of numbers). P indicates the P-value of the log-rank test.

FIGS. 28A-28F Kaplan Meier plots for 295 breast cancer patients classified into "very good prognosis," "intermediate prognosis," and "poor prognosis" groups. FIG. 28A shows the metastasis-free probability of all 295 patients according to "very good", "intermediate" and "poor prognosis" signature. FIG. 28B shows the overall survival of all 295 patients according to "very good," "intermediate," and "poor prognosis" signature. FIG. 28C shows the metastasis-free probability for lymph node negative patients similarly classified. FIG. 28D shows the overall survival for lymph node negative patients so classified. FIG. 28E shows the metastasis-free probability for lymph node positive patients so classified. FIG. 28F shows the overall survival of lymph node positive patients so classified. Patients at risk at each time point (years; see description of FIG 26) are indicated in each plot for "very good" signature patients (top line; top row of numbers), "intermediate" signature patients (middle line; middle row of numbers) or "poor prognosis" signature patients (bottom line; bottom row of numbers) patients. P indicates the P-value of the log-rank test. 5. DETAILED DESCRIPTION OF THE INVENTION 5.1 INTRODUCTION The invention relates to sets of genetic markers whose expression patterns correlate with important characteristics of breast cancer tumors, i.e., estrogen receptor (ER) status, BRCAl status, and the likelihood of relapse (i.e., distant metastasis or poor prognosis). More specifically, the invention provides for sets of genetic markers that can distinguish the following three clinical conditions. First, the invention relates to sets of markers whose expression correlates with the ER status of a patient, and which can be used to distinguish ER(+) from ER(-) patients. ER status is a useful prognostic indicator, and an indicator of the likelihood that a patient will respond to certain therapies, such as tamoxifen. Also, among women who are ER positive the response rate (over 50%) to hormonal therapy is much higher than the response rate (less 10%) in patients whose ER status is negative. In patients with ER positive tumors the possibility of achieving a hormonal response is directly proportional to the level ER (P. Calabresi and P.S. Schein, MEDICAL ONCOLOGY (2ND ED.), McGraw-Hill, hie, New York (1993)). Second, the invention further relates to sets of markers whose expression correlates with the presence of BRCAl mutations, and which can be used to distinguish J9RG4i-type tumors from sporadic tumors. Third, the invention relates to genetic markers whose expression correlates with clinical prognosis, and which can be used to distinguish patients having good prognoses (i. e., no distant metastases of a tumor within five years) from poor prognoses (i.e., distant metastases of a tumor within five years). Methods are provided for use of these markers to distinguish between these patient groups, and to determine general courses of treatment. Microarrays comprising these markers are also provided, as well as methods of constructing such microarrays. Each markers correspond to a gene in the human genome, i.e., such marker is identifiable as all or a portion of a gene. Finally, because each of the above markers correlates with a certain breast cancer-related conditions, the markers, or the proteins they encode, are likely to be targets for drugs against breast cancer.

5.2 DEFINITIONS As used herein, "BRCAl tumor" means a tumor having cells containing a mutation of the BRCAl locus.

The "absolute amplitude" of conelation expressions means the distance, either positive or negative, from a zero value; i.e., both correlation coefficients -0.35 and 0.35 have an absolute amplitude of 0.35. "Status" means a state of gene expression of a set of genetic markers whose expression is strongly correlated with a particular phenotype. For example, "ER status" means a state of gene expression of a set of genetic markers whose expression is strongly correlated with that of ESRl (estrogen receptor gene), wherein the pattern of these genes' expression differs detectably between tumors expressing the receptor and tumors not expressing the receptor.

"Good prognosis" means that a patient is expected to have no distant metastases of a breast tumor within five years of initial diagnosis of breast cancer.

"Poor prognosis" means that a patient is expected to have distant metastases of a breast tumor within five years of initial diagnosis of breast cancer.

"Marker" means an entire gene, or an EST derived from that gene, the expression or level of which changes between certain conditions. Where the expression of the gene correlates with a certain condition, the gene is a marker for that condition.

"Marker-derived polynucleotides" means the RNA transcribed from a marker gene, any cDNA or cRNA produced therefrom, and any nucleic acid derived therefrom, such as synthetic nucleic acid having a sequence derived from the gene corresponding to the marker gene.

A "similarity value" is a number that represents the degree of similarity between two things being compared. For example, a similarity value may be a number that indicates the overall similarity between a patient's expression profile using specific phenotype-related markers and a control specific to that phenotype (for instance, the similarity to a "good prognosis" template, where the phenotype is a good prognosis). The similarity value may be expressed as a similarity metric, such as a correlation coefficient, or may simply be expressed as the expression level difference, or the aggregate of the expression level differences, between a patient sample and a template.

5.3 MARKERS USEFUL IN DIAGNOSIS AND PROGNOSIS OF BREAST CANCER

5.3.1 MARKER SETS The invention provides a set of 4,986 genetic markers whose expression is correlated with the existence of breast cancer by clustering analysis. A subset of these markers identified as useful for diagnosis or prognosis is listed as SEQ ID NOS: 1-2,699. The invention also provides a method of using these markers to distinguish tumor types in diagnosis or prognosis. In one embodiment, the invention provides a set of 2,460 genetic markers that can classify breast cancer patients by estrogen receptor (ER) status; i.e., distinguish between ER(+) and ER(-) patients or tumors derived from these patients. ER status is an important indicator of the likelihood of a patient's response to some chemotherapies (i.e., tamoxifen). These markers are listed in Table 1. The invention also provides subsets of at least 5, 10, 25, 50, 100, 200, 300, 400, 500, 750, 1,000, 1,250, 1,500, 1,750 or 2,000 genetic markers, drawn from the set of 2,460 markers, which also distinguish ER(+) and ER(-) patients or tumors. Preferably, the number of markers is 550. The invention further provides a set of 550 of the 2,460 markers that are optimal for distinguishing ER status (Table 2). The invention also provides a method of using these markers to distinguish between ER(+) and ER(-) patients or tumors derived therefrom.

In another embodiment, the invention provides a set of 430 genetic markers that can classify ER(-) breast cancer patients by BRCAl status; i.e., distinguish between tumors containing a BRCAl mutation and sporadic tumors. These markers are listed in Table 3. The invention further provides subsets of at least 5, 10 20, 30, 40, 50, 75, 100, 150, 200, 250, 300 or 350 markers, drawn from the set of 430 markers, which also distinguish between tumors containing a BRCAl mutation and sporadic tumors. Preferably, the number of markers is 100. A prefened set of 100 markers is provided in Table 4. The invention also provides a method of using these markers to distinguish betweenBRCAl and sporadic patients or tumors derived therefrom.

In another embodiment, the invention provides a set of 231 genetic markers that can distinguish between patients with a good breast cancer prognosis (no breast cancer tumor distant metastases within five years) and patients with a poor breast cancer prognosis (tumor distant metastases within five years). These markers are listed in Table 5. The invention also provides subsets of at least 5, 10, 20, 30, 40, 50, 75, 100, 150 or 200 markers, drawn from the set of 231, which also distinguish between patients with good and poor prognosis. A preferred set of 70 markers is provided in Table 6. In a specific embodiment, the set of markers consists of the twelve kinase-related markers and the seven cell division- or mitosis-related markers listed. The invention also provides a method of using the above markers to distinguish between patients with good or poor prognosis. In another embodiment, the invention provides a method of using the prognosis-associated markers to distinguish between patients having a very good prognosis, an intermediate prognosis, and a poor prognosis, and thereby determining the appropriate combination of adjuvant or hormonal therapy. Table 1. 2,460 gene markers that distinguish ER(+) and ER(-) cell samples.

GenBank SEQ ED NO GenBank SEQ JD NO

Accession Number Accession Number

NM 001953 SEQ ED NO 699 Contigl0363_RC SEQ JD NO 2042

NM 001954 SEQ JD NO 700 Contigl0437_RC SEQ ID NO 2043

INM 001955 SEQ JD NO 701 Contigl l086_RC SEQ ID NO 2045

NM 001956 SEQ ID NO 702 Contigll275_RC SEQ ID NO 2046

NM 001958 SEQ ED NO 703 Contigll648_RC SEQ LD NO 2047

NM 001961 SEQ JD NO 705 Contigl2216_RC SEQ JD NO 2048

NM 001970 SEQ JD NO 706 Contigl2369_RC SEQ D NO 2049

NM 001979 SEQ ED NO 707 Contigl2814_RC SEQ LD NO 2050

NM 001982 SEQ ID NO 708 Contigl2951_RC SEQ ED NO 2051

NM 002017 SEQ ED NO 710 Contigl3480_RC SEQ ED NO 2052

NM 002033 SEQ ED NO 713 Contigl4284_RC SEQ ED NO 2053

NM 002046 SEQ ED NO 714 Contigl4390_RC SEQ ED NO 2054

NM 002047 SEQ ED NO 715 Contigl4780_RC SEQ D NO 2055

NM 002051 SEQ ED NO 716 Contigl4954_RC SEQ ID NO 2056

NM 002053 SEQ ED NO 717 Contigl4981_RC SEQ ID NO 2057

NM 002061 SEQ ED NO 718 Contigl5692_RC SEQ ID NO 2058

NM 002065 SEQ ED NO 719 Contigl6192_RC SEQ ED NO 2059

NM 002068 SEQ ED NO 720 Contigl6759_RC SEQ JD NO 2061

NM 002077 SEQ JD NO 722 Contigl6786_RC SEQ TD NO 2062

NM 002091 SEQ ED NO 723 Contigl6905_RC SEQ JD NO 2063

NM 002101 SEQ JD NO 724 Contigl7103_RC SEQ JD NO 2064

NM 002106 SEQ JD NO 725 Contigl7105_RC SEQ ID NO 2065

NM 002110 SEQ JD NO 726 Contigl7248_RC SEQ JD NO 2066

NM 002111 SEQ JD NO 727 Contigl7345_RC SEQ TD NO 2067

NM 002115 SEQ LD NO 728 Contigl8502_RC SEQ ED NO 2069

INM 002118 SEQ JD NO 729 Contig20156_RC SEQ TD NO 2071

NM 002123 SEQ ID NO 730 Contig20302_RC SEQ JD NO 2073

NM 002131 SEQ ID NO 731 Contig20600_RC SEQ D NO 2074

NM 002136 SEQ ED NO 732 Contig20617_RC SEQ D NO 2075

,NM 002145 SEQ ED NO 733 Contig20629_RC SEQ ED NO 2076

NM 002164 SEQ ED NO 734 Contig20651_RC SEQ ED NO 2077

NM 002168 SEQ ED NO 735 Contig21130_RC SEQ ED NO 2078

NM 002184 SEQ ED NO 736 Contig21185_RC SEQ ED NO 2079

NM 002185 SEQ LD NO 737 Contig21421_RC SEQ JD NO 2080

NM 002189 SEQ ED NO 738 Contig21787_RC SEQ ED NO 2081

NM 002200 SEQ ED NO 739 Contig21812_RC SEQ ED NO 2082

NM 002201 SEQ ED NO 740 Contig22418_RC SEQ ED NO 2083

GenBank SEQ ED NO GenBank SEQ JD NO Accession Number Accession Number

NM 005572 SEQ ID NO 1176 Contig53242_RC SEQ JD NO 2526

NM 005582 SEQ ED NO 1177 Contig53248_RC SEQ JD NO 2527

NM 005608 SEQ ED NO 1178 Contig53260_RC SEQ ED NO 2528

NM 005614 SEQ ED NO 1179 Contig53296_RC SEQ ED NO 2531

NM 005617 SEQ ED NO 1180 Contig53307_RC SEQ ED NO 2532

NM 005620 SEQ ED NO 1181 Contig53314_RC SEQ ED NO 2533

NM 005625 SEQ ED NO 1182 Contig53401_RC SEQ ID NO 2534

NM 005651 SEQ ED NO 1183 Contig53550_RC SEQ ED NO 2535

NM 005658 SEQ ED NO 1184 Contig53551_RC SEQ ED NO 2536

NM 005659 SEQ ED NO 1185 Contig53598_RC SEQ ED NO 2537

NM 005667 SEQ ED NO 1186 Contig53646_RC SEQ JD NO 2538

NM 005686 SEQ ID NO 1187 Contig53658_RC SEQ ED NO 2539

NM 005690 SEQ ED NO 1188 Contig53698_RC SEQ ED NO 2540

NM 005720 SEQ ED NO 1190 Contig53719_RC SEQ ED NO 2541

NM 005727 SEQ ED NO 1191 Contig53742_RC SEQ ED NO 2542

NM 005733 SEQ ED NO 1192 Contig53757_RC SEQ ED NO 2543

NM 005737 SEQ ID NO 1193 Contig53870_RC SEQ ED NO 2544

NM 005742 SEQ JD NO 1194 Contig53952_RC SEQ JD NO 2546

NM 005746 SEQ ED NO 1195 Contig53962_RC SEQ ED NO 2547

NM 005749 SEQ ED NO 1196 Contig53968_RC SEQ JD NO 2548

NM 005760 SEQ JD NO 1197 Contig54113_RC SEQ ED NO 2549

NM 005764 SEQ ED NO 1198 Contig54142_RC SEQ ED NO 2550

NM 005794 SEQ ID NO 1199 Contig54232_RC SEQ ID NO 2551

NM 005796 SEQ ID NO 1200 Contig54242_RC SEQ ID NO 2552

NM 005804 SEQ JD NO 1201 Contig54260_RC SEQ JD NO 2553

NM 005813 SEQ ID NO 1202 Contig54263_RC SEQ ED NO 2554

NM 005824 SEQ LD NO 1203 Contig54295_RC SEQ LD NO 2555

NM 005825 SEQ LD NO 1204 Contig54318_RC SEQ JD NO 2556

NM 005849 SEQ ED NO 1205 Contig54325_RC SEQ ED NO 2557

NM 005853 SEQ LD NO 1206 Contig54389_RC SEQ JD NO 2558

NM 005855 SEQ JD NO 1207 Contig54394_RC SEQ ID NO 2559

NM 005864 SEQ ID NO 1208 Contig54414_RC SEQ TD NO 2560

NM 005874 SEQ ID NO 1209 Contig54425 SEQ JD NO 2561

NM 005876 SEQ JD NO 1210 Contig54477_RC SEQ TD NO 2562

NM 005880 SEQ JD NO 1211 Contig54503_RC SEQ TD NO 2563

NM 005891 SEQ JD NO 1212 Contig54534_RC SEQ JD NO 2564

NM 005892 SEQ ED NO 1213 Contig54560_RC SEQ JD NO 2566

Table 2. 550 preferred ER status markers drawn from Table 1.

Table 3. 430 gene markers that distinguish #RC4i-related tumor samples from sporadic tumor samples

Table 4. 100 preferred markers from Table 3 distinguishing i?RG4i -related tumors from sporadic tumors.

Table 5. 231 gene markers that distinguish patients with good prognosis from patients with poor prognosis.

Table 6. 70 Preferred prognosis markers drawn from Table 5.

Table 6. 70 Preferred prognosis markers drawn from Table 5.

98

99

100 Table 7. Good and poor prognosis templates: mean subtracted log(intensity) values for each of the seventy markers listed in Table 6 for 44 breast cancer patients having a good prognosis (CI) or 34 breast cancer patients having a poor prognosis (C2) (see Examples).

101

The sets of markers listed in Tables 1-6 partially overlap; in other words, some markers are present in multiple sets, while other markers are unique to a set (FIG. 1). Thus, in one embodiment, the invention provides a set of 256 genetic markers that can distinguish between ER(+) and ER(-), and also between BRCAl tumors and sporadic tumors (i.e., classify a tumor as ER(-) or ER(-) and Z?RG4 -related or sporadic), hi a more specific embodiment, the invention provides subsets of at least 20, at least 50, at least 100, or at least 150 of the set of 256 markers, that can classify a tumor as ER(-) or ER(-) and #RC4i-related or sporadic. In another embodiment, the invention provides 165 markers that can distinguish between ER(+) and ER(-), and also between patients

102 with good versus poor prognosis (i.e., classify a tumor as either ER(-) or ER(+) and as having been removed from a patient with a good prognosis or a poor prognosis). In a more specific embodiment, the invention further provides subsets of at least 20, 50, 100 or 125 of the full set of 165 markers, which also classify a tumor as either ER(-) or ER(+) and as having been removed from a patient with a good prognosis or a poor prognosis The invention further provides a set of twelve markers that can distinguish between BRCAl tumors and sporadic tumors, and between patients with good versus poor prognosis. Finally, the invention provides eleven markers capable of differentiating all three statuses. Conversely, the invention provides 2,050 of the 2,460 ER-status markers that can determine only ER status, 173 of the 430 BRCAl v. sporadic markers that can determine only BRCAl v. sporadic status, and 65 of the 231 prognosis markers that can only determine prognosis. In more specific embodiments, the invention also provides for subsets of at least 20, 50, 100, 200, 500, 1,000, 1,500 or 2,000 of the 2,050 ER-status markers that also determine only ER status. The invention also provides subsets of at least 20, 50, 100 or 150 of the 173 markers that also determine only BRCAl v. sporadic status. The invention further provides subsets of at least 20, 30, 40, or 50 of the 65 prognostic markers that also determine only prognostic status.

Any of the sets of markers provided above maybe used alone specifically or in combination with markers outside the set. For example, markers that distinguish ER-status may be used in combination with the BRCAl vs. sporadic markers, or with the prognostic markers, or both. Any of the marker sets provided above may also be used in combination with other markers for breast cancer, or for any other clinical or physiological condition.

The relationship between the marker sets is diagramed in FIG. 1.

5.3.2 IDENTIFICATION OF MARKERS

The present invention provides sets of markers for the identification of conditions or indications associated with breast cancer. Generally, the marker sets were identified by determining which of ~25,000 human markers had expression patters that correlated with the conditions or indications. In one embodiment, the method for identifying marker sets is as follows.

After extraction and labeling of target polynucleotides, the expression of all markers (genes) in a sample X is compared to the expression of all markers in a standard or control. In one embodiment, the standard or control comprises target polynucleotide

103 molecules derived from a sample from a normal individual (i.e., an individual not afflicted with breast cancer). In a preferred embodiment, the standard or control is a pool of target polynucleotide molecules. The pool may derived from collected samples from a number of normal individuals. In a prefened embodiment, the pool comprises samples taken from a number of individuals having sporadic-type tumors. In another prefened embodiment, the pool comprises an artificially-generated population of nucleic acids designed to approximate the level of nucleic acid derived from each marker found in a pool of marker-derived nucleic acids derived from tumor samples. In yet another embodiment, the pool is derived from normal or breast cancer cell lines or cell line samples.

The comparison may be accomplished by any means known in the art. For example, expression levels of various markers maybe assessed by separation of target polynucleotide molecules (e.g., RNA or cDNA) derived from the markers in agarose or polyacrylamide gels, followed by hybridization with marker-specific oligonucleotide probes. Alternatively, the comparison may be accomplished by the labeling of target polynucleotide molecules followed by separation on a sequencing gel. Polynucleotide samples are placed on the gel such that patient and control or standard polynucleotides are in adjacent lanes. Comparison of expression levels is accomplished visually or by means of densitometer. In a preferred embodiment, the expression of all markers is assessed simultaneously by hybridization to a microarray. In each approach, markers meeting certain criteria are identified as associated with breast cancer.

A marker is selected based upon significant difference of expression in a sample as compared to a standard or control condition. Selection may be made based upon either significant up- or down regulation of the marker in the patient sample. Selection may also be made by calculation of the statistical significance (i.e., the p-value) of the conelation between the expression of the marker and the condition or indication. Preferably, both selection criteria are used. Thus, in one embodiment of the present invention, markers associated with breast cancer are selected where the markers show both more than two-fold change (increase or decrease) in expression as compared to a standard, and the p-value for the conelation between the existence of breast cancer and the change in marker expression is no more than 0.01 (i.e., is statistically significant).

The expression of the identified breast cancer-related markers is then used to identify markers that can differentiate tumors into clinical types, i a specific embodiment using a number of tumor samples, markers are identified by calculation of

104 conelation coefficients between the clinical category or clinical parameter(s) and the linear, logarithmic or any transform of the expression ratio across all samples for each individual gene. Specifically, the conelation coefficient is calculated as

P = (c « r)/d|c|| - |r|) Equation (2) where c represents the clinical parameters or categories and r represents the linear, logarithmic or any transform of the ratio of expression between sample and control. Markers for which the coefficient of conelation exceeds a cutoff are identified as breast cancer-related markers specific for a particular clinical type. Such a cutoff or threshold conesponds to a certain significance of discriminating genes obtained by Monte Carlo simulations. The threshold depends upon the number of samples used; the threshold can be calculated as 3 X 1/ /n - 3 , where l/ jn - 3 is the distribution width and n = the number of samples. In a specific embodiment, markers are chosen if the conelation coefficient is greater than about 0.3 or less than about -0.3.

Next, the significance of the conelation is calculated. This significance may be calculated by any statistical means by which such significance is calculated. In a specific example, a set of conelation data is generated using a Monte-Carlo technique to randomize the association between the expression difference of a particular marker and the clinical category. The frequency distribution of markers satisfying the criteria through calculation of conelation coefficients is compared to the number of markers satisfying the criteria in the data generated through the Monte-Carlo technique. The frequency distribution of markers satisfying the criteria in the Monte-Carlo runs is used to determine whether the number of markers selected by conelation with clinical data is significant. See Example 4.

Once a marker set is identified, the markers may be rank-ordered in order of significance of discrimination. One means of rank ordering is by the amplitude of conelation between the change in gene expression of the marker and the specific condition being discriminated. Another, prefened, means is to use a statistical metric. In a specific embodiment, the metric is a Fisher-like statistic:

(( ι) - ( ₂))/ Equation (3) [σ (r - 1) +

- ϊ))(n_λ + n₂ - ϊ)/(l/n_λ + l/n₂ ) In this equation, (x_λ } is the enor- weighted average of the log ratio of transcript expression measurements within a first diagnostic group (e.g., ER(-), ( ₂) *^{s me} eπror-

105 weighted average of log ratio within a second, related diagnostic group (e.g., ER(+)), Oχ is the variance of the log ratio within the ER(-) group and ri is the number of samples for which valid measurements of log ratios are available. (T₂ is the variance of log ratio within the second diagnostic group (e.g., ER(+)), and n₂ is the number of samples for which valid measurements of log ratios are available. The t- value represents the variance-compensated difference between two means.

The rank-ordered marker set may be used to optimize the number of markers in the set used for discrimination. This is accomplished generally in a "leave one out" method as follows. In a first run, a subset, for example 5, of the markers from the top of the ranked list is used to generate a template, where out of X samples, X-1 are used to generate the template, and the status of the remaining sample is predicted. This process is repeated for every sample until every one of the X samples is predicted once. In a second run, additional markers, for example 5, are added, so that a template is now generated from 10 markers, and the outcome of the remaining sample is predicted. This process is repeated until the entire set of markers is used to generate the template. For each of the runs, type 1 enor (false negative) and type 2 enors (false positive) are counted; the optimal number of markers is that number where the type 1 enor rate, or type 2 enor rate, or preferably the total of type 1 and type 2 enor rate is lowest.

For prognostic markers, validation of the marker set may be accomplished by an additional statistic, a survival model. This statistic generates the probability of tumor distant metastases as a function of time since initial diagnosis. A number of models may be used, including Weibull, normal, log-normal, log logistic, log- exponential, or log-Rayleigh (Chapter 12 "Life Testing", S-PLUS 2000 GUIDE TO STATISTICS, Vol. 2, p. 368 (2000)). For the "normal" model, the probability of distant metastases P at time t is calculated as

P = X exp (-t²/τ² ) Equation (4) where CL is fixed and equal to 1, and T is a parameter to be fitted and measures the "expected lifetime".

It will be apparent to those skilled in the art that the above methods, in particular the statistical methods, described above, are not limited to the identification of markers associated with breast cancer, but may be used to identify set of marker genes associated with any phenotype. The phenotype can be the presence or absence of a

106 disease such as cancer, or the presence or absence of any identifying clinical condition associated with that cancer. In the disease context, the phenotype may be a prognosis such as a survival time, probability of distant metastases of a disease condition, or likelihood of a particular response to a therapeutic or prophylactic regimen. The phenotype need not be cancer, or a disease; the phenotype may be a nominal characteristic associated with a healthy individual.

5.3.3 SAMPLE COLLECTION In the present invention, target polynucleotide molecules are extracted from a sample taken from an individual afflicted with breast cancer. The sample may be collected in any clinically acceptable manner, but must be collected such that marker- derived polynucleotides (i.e., RNA) are preserved. mRNA or nucleic acids derived therefrom (i.e., cDNA or amplified DNA) are preferably labeled distinguishably from standard or control polynucleotide molecules, and both are simultaneously or independently hybridized to a microanay comprising some or all of the markers or marker sets or subsets described above. Alternatively, mRNA or nucleic acids derived therefrom may be labeled with the same label as the standard or control polynucleotide molecules, wherein the intensity of hybridization of each at a particular probe is compared. A sample may comprise any clinically relevant tissue sample, such as a tumor biopsy or fine needle aspirate, or a sample of bodily fluid, such as blood, plasma, serum, lymph, ascitic fluid, cystic fluid, urine or nipple exudate. The sample may be taken from a human, or, in a veterinary context, from non-human animals such as ruminants, horses, swine or sheep, or from domestic companion animals such as felines and canines.

Methods for preparing total and poly(A)+ RNA are well known and are described generally in Sambrook et al, MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989)) and Ausubel et al, CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, vol. 2, Cunent Protocols Publishing, New York (1994)).

RNA may be isolated from eukaryotic cells by procedures that involve lysis of the cells and denaturation of the proteins contained therein. Cells of interest include wild-type cells (i.e., non-cancerous), drag-exposed wild-type cells, tumor- or tumor-derived cells, modified cells, normal or tumor cell line cells, and drug-exposed modified cells.

107 Additional steps may be employed to remove DNA. Cell lysis may be accomplished with a nonionic detergent, followed by microcentrifugation to remove the nuclei and hence the bulk of the cellular DNA. In one embodiment, RNA is extracted from cells of the various types of interest using guanidinium thiocyanate lysis followed by CsCl centrifugation to separate the RNA from DNA (Chirgwin et al, Biochemistry 18:5294-5299 (1979)). Poly(A)+ RNA is selected by selection with oligo-dT cellulose (see Sambrook et al, , MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989). Alternatively, separation of RNA from DNA can be accomplished by organic extraction, for example, with hot phenol or phenol/chloroform/isoamyl alcohol.

If desired, RNAse inhibitors may be added to the lysis buffer. Likewise, for certain cell types, it may be desirable to add a protein denaturation/digestion step to the protocol.

For many applications, it is desirable to preferentially enrich mRNA with respect to other cellular RNAs, such as transfer RNA (tRNA) and ribosomal RNA (rRNA). Most mRNAs contain a poly(A) tail at their 3' end. This allows them to be enriched by affinity chromatography, for example, using oligo(dT) or poly(U) coupled to a solid support, such as cellulose or Sephadex™ (see Ausubel et al., CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, vol. 2, Cunent Protocols Publishing, New York (1994). Once bound, poly(A)+ mRNA is eluted from the affinity column using 2 mM EDTA 0.1% SDS.

The sample of RNA can comprise a plurality of different mRNA molecules, each different mRNA molecule having a different nucleotide sequence. In a specific embodiment, the mRNA molecules in the RNA sample comprise at least 100 different nucleotide sequences. More preferably, the mRNA molecules of the RNA sample comprise mRNA molecules conesponding to each of the marker genes. In another specific embodiment, the RNA sample is a mammalian RNA sample.

In a specific embodiment, total RNA or mRNA from cells are used in the methods of the invention. The source of the RNA can be cells of a plant or animal, human, mammal, primate, non-human animal, dog, cat, mouse, rat, bird, yeast, eukaryote, prokaryote, etc. In specific embodiments, the method of the invention is used with a sample containing total mRNA or total RNA from 1 x 10⁶ cells or less. In another embodiment, proteins can be isolated from the foregoing sources, by methods known in the art, for use in expression analysis at the protein level.

108 Probes to the homologs of the marker sequences disclosed herein can be employed preferably wherein non-human nucleic acid is being assayed.

5.4 METHODS OF USING BREAST CANCER MARKER SETS 5.4.1 DIAGNOSTIC METHODS The present invention provides for methods of using the marker sets to analyze a sample from an individual so as to determine the individual's tumor type or subtype at a molecular level, whether a tumor is of the ER(+) or ER(-) type, and whether the tumor is BRCAl -associated or sporadic. The individual need not actually be afflicted with breast cancer. Essentially, the expression of specific marker genes in the individual, or a sample taken therefrom, is compared to a standard or control. For example, assume two breast cancer-related conditions, X and Y. One can compare the level of expression of breast cancer prognostic markers for condition X in an individual to the level of the marker-derived polynucleotides in a control, wherein the level represents the level of expression exhibited by samples having condition X. In this instance, if the expression of the markers in the individual's sample is substantially (i.e., statistically) different from that of the control, then the individual does not have condition X. Where, as here, the choice is bimodal (i.e., a sample is either X or Y), the individual can additionally be said to have condition Y. Of course, the comparison to a control representing condition Y can also be performed. Preferably both are performed simultaneously, such that each control acts as both a positive and a negative control. The distinguishing result may thus either be a demonstrable difference from the expression levels (i.e., the amount of marker-derived RNA, or polynucleotides derived therefrom) represented by the control, or no significant difference.

Thus, in one embodiment, the method of determining a particular tumor- related status of an individual comprises the steps of (1) hybridizing labeled target polynucleotides from an individual to a microanay containing one of the above marker sets; (2) hybridizing standard or control polynucleotides molecules to the microanay, wherein the standard or control molecules are differentially labeled from the target molecules; and (3) determining the difference in transcript levels, or lack thereof, between the target and standard or control, wherein the difference, or lack thereof, determines the individual's tumor-related status. In a more specific embodiment, the standard or control molecules comprise marker-derived polynucleotides from a pool of samples from normal individuals, or a pool of tumor samples from individuals having sporadic-type tumors. In

109 a prefened embodiment, the standard or control is an artificially-generated pool of marker-derived polynucleotides, which pool is designed to mimic the level of marker expression exhibited by clinical samples of normal or breast cancer tumor tissue having a particular clinical indication (i.e., cancerous or non-cancerous; ER(+) or ER(-) tumor; BRCAl- or sporadic type tumor). In another specific embodiment, the control molecules comprise a pool derived from normal or breast cancer cell lines.

The present invention provides sets of markers useful for distinguishing ER(+) from ER(-) tumor types. Thus, in one embodiment of the above method, the level of polynucleotides (i.e., mRNA or polynucleotides derived therefrom) in a sample from an individual, expressed from the markers provided in Table 1 are compared to the level of expression of the same markers from a control, wherein the control comprises marker- related polynucleotides derived from ER(+) samples, ER(-) samples, or both. Preferably, the comparison is to both ER(+) and ER(-), and preferably the comparison is to polynucleotide pools from a number of ER(+) and ER(-) samples, respectively. Where the individual's marker expression most closely resembles or conelates with the ER(+) control, and does not resemble or conelate with the ER(-) control, the individual is classified as ER(+). Where the pool is not pure ER(+) or ER(-), for example, a sporadic pool is used. A set of experiments should be performed in which nucleic acids from individuals with known ER status are hybridized against the pool, in order to define the expression templates for the ER(+) and ER(-) group. Nucleic acids from each individual with unknown ER status are hybridized against the same pool and the expression profile is compared to the templates (s) to determine the individual's ER status.

The present invention provides sets of markers useful for distinguishing Z?RG4i -related tumors from sporadic tumors. Thus, the method can be performed substantially as for the ER(+/-) determination, with the exception that the markers are those listed in Tables 3 and 4, and the control markers are a pool of marker-derived polynucleotides BRCAl tumor samples, and a pool of marker-derived polynucleotides from sporadic tumors. A patient is determined to have a BRCAl germline mutation where the expression of the individual's marker-derived polynucleotides most closely resemble, or are most closely conelated with, that of the BRCAl control. Where the control is not pure BRCAl or sporadic, two templates can be defined in a manner similar to that for ER status, as described above.

For the above two embodiments of the method, the full set of markers may be used (i.e., the complete set of markers for Tables 1 or 3). In other embodiments,

110 subsets of the markers may be used. In a prefened embodiment, the prefened markers listed in Tables 2 or 4 are used.

The similarity between the marker expression profile of an individual and that of a control can be assessed a number of ways. In the simplest case, the profiles can be compared visually in a printout of expression difference data. Alternatively, the similarity can be calculated mathematically.

In one embodiment, the similarity between two patients x and y, or patient x and a template y, expressed as a similarity value, can be calculated using the following equation:

S = l Equation (5)

In this equation, Xand are two patients with components of log ratio x_i and y_i, i = l, 2,..., N- 4,986. Associated with every value x_t is enor σ_x . The smaller the value σ_x ,

the more reliable the measurement x is the enor- weighted

arithmetic mean.

In a prefened embodiment, templates are developed for sample comparison. The template is defined as the enor- weighted log ratio average of the expression difference for the group of marker genes able to differentiate the particular breast cancer-related condition. For example, templates are defined for ER(+) samples and for ER(-) samples. Next, a classifier parameter is calculated. This parameter may be calculated using either expression level differences between the sample and template, or by calculation of a conelation coefficient. Such a coefficient, Pj, can be calculated using the following equation: ^{p i}i'^y m) _{Equation (1)} where Z_z- is the expression template i, andy is the expression profile of a patient.

Thus, in a more specific embodiment, the above method of determining a particular tumor-related status of an individual comprises the steps of (1) hybridizing labeled target polynucleotides from an individual to a microanay containing one of the above marker sets; (2) hybridizing standard or control polynucleotides molecules to the microanay, wherein the standard or control molecules are differentially labeled from the

111 target molecules; and (3) determining the ratio (or difference) of transcript levels between two channels (individual and control), or simply the transcript levels of the individual; and (4) comparing the results from (3) to the predefined templates, wherein said determining is accomplished by means of the statistic of Equation 1 or Equation 5, and wherein the difference, or lack thereof, determines the individual's tumor-related status.

5.4.2 PROGNOSTIC METHODS The present invention provides sets of markers useful for classifying patients with into different prognostic categories. For example, the invention further provides a method for using these markers to determine whether an individual afflicted with breast cancer will have a good or poor clinical prognosis. The present invention further provides a method of further classifying "good prognosis" patients into two groups: those having a 'very good prognosis" and those having an "intermediate prognosis." For each of the above classifications, the invention further provides recommended therapeutic regimens. The method can use the complete set of markers listed in Table 5.

However, subsets of the markers listed in Table 5 may also be used. In a prefened embodiment, the subset of 70 markers listed in Table 6 is used. At least 5, 10, 15, 20, 25, 30, 40, 50, 60, or all 70 of the markers in Table 6 may be used.

Classification of a sample as "good prognosis" or "poor prognosis" is accomplished substantially as for the diagnostic markers described above, wherein a template is generated to which the marker expression levels in the sample are compared.

Thus, in one embodiment of the above method, the level of polynucleotides (i.e., mRNA or polynucleotides derived therefrom) in a sample from an individual breast cancer patient, expressed from the markers provided in Table 5, is compared to the level of expression of the same markers from a control, wherein the control comprises marker-related polynucleotides derived from breast cancer tumor samples taken from breast cancer patients clinically determined to have a good prognosis ("good prognosis" control), breast cancer patients clinically determined to have a poor prognosis ("poor prognosis" control), or both. The comparison may be to both good prognosis and poor prognosis controls, and preferably the comparison is to polynucleotide pools from a number of good prognosis and poor prognosis samples, respectively. Where the individual's marker expression most closely resembles or conelates with the good prognosis control, and does not resemble or conelate with the poor prognosis control, the

112 individual is classified as having a good prognosis. Where the pool is not pure 'good prognosis' or 'poor prognosis', a set of experiments should be performed in which nucleic acids from samples from individuals with known outcomes are hybridized against the pool to define the expression templates for the good prognosis and poor prognosis groups. Nucleic acids from each individual with unknown outcome are hybridized against the same pool and the resulting expression profile is compared to the templates to predict its outcome.

The control or standard may be presented in a number of different formats. For example, the control, or template, to which the expression of marker genes in a breast cancer tumor sample is compared may be the average absolute level of expression of each of the genes in a pool of marker-derived nucleic acids pooled from breast cancer tumor samples obtained from a plurality of breast cancer patients. In this case, the difference between the absolute level of expression of these genes in the control and in a sample from a breast cancer patient provides the degree of similarity or dissimilarity of the level of expression in the patient sample and the control. The absolute level of expression may be measured by the intensity of the hybridization of the nucleic acids to an anay. In other embodiments, the values for the expression levels of the markers in both the patient sample and control are transformed (see Section 5.4.3). For example, the expression level value for the patient, and the average expression level value for the pool, for each of the marker genes selected, may be transformed by taking the logarithm of the value.

Moreover, the expression level values may be normalized by, for example, dividing by the median hybridization intensity of all of the samples that make up the pool. The control may be derived from hybridization data obtained simultaneously with the patient sample expression data, or may constitute a set of numerical values stores on a computer, or on computer-readable medium.

In one embodiment, the invention provides for method of determining whether an individual afflicted with breast cancer will likely experience a relapse within five years of initial diagnosis (i.e., whether an individual has a poor prognosis) comprising (1) comparing the level of expression of the markers listed in Table 5 in a sample taken from the individual to the level of the same markers in a standard or control, where the standard or control levels represent those found in an individual with a poor prognosis; and (2) determining whether the level of the marker-related polynucleotides in the sample from the individual is significantly different than that of the control, wherein if no substantial difference is found, the patient has a poor prognosis, and if a substantial

113 difference is found, the patient has a good prognosis. Persons of skill in the art will readily see that the markers associated with good prognosis can also be used as controls. In a more specific embodiment, both controls are run.

Poor prognosis of breast cancer may indicate that a tumor is relatively aggressive, while good prognosis may indicate that a tumor is relatively nonaggressive. Therefore, the invention provides for a method of determining a course of treatment of a breast cancer patient, comprising determining whether the level of expression of the 231 markers of Table 5, or a subset thereof, conelates with the level of these markers in a sample representing a good prognosis expression pattern or a poor prognosis pattern; and determining a course of treatment, wherein if the expression conelates with the poor prognosis pattern, the tumor is treated as an aggressive tumor.

Patients having an expression profile conelating with the good prognosis profile may be further divided into "very good prognosis" and "intermediate prognosis" groups. In the original 78 samples used to determine the 70 optimal prognostic marker genes, patients whose expression profile conelated with (i.e., had a conelation coefficient less than 0.40) the average "good prognosis" expression profile were classified as having a "good prognosis." It was subsequently found that tumors with an expression profile having a coefficient of conelation to the average "good prognosis" expression profile greater than 0.636 developed no distant metastases. These patients may receive a different therapeutic regimen than patients whose tumors have a "good prognosis" expression profile that conelates less strongly to the average "good prognosis" expression profile. Accordingly, patients were classified as having a "very good prognosis" expression profile if the conelation coefficient exceeded 0.636, and an "intermediate prognosis" if their expression profile conelation coefficient was 0.39 or less but less than or equal to 0.636. The data for the 70 genes listed in Table 6 for these 78 patients is listed in Table 7.

This methodology may be generalized to situations in which data from other groups of patients is used, where a group of patients is to provide clinical and expression data to be used for classification of subsequent breast cancer patients. A group of patients is selected for which clinical and foUowup data are available for at least five years after initial diagnosis. Preferably the patients in the group are selected as a consecutive series to reduce or eliminate selection bias. Breast cancer tumor samples are taken from each patient, and marker-related polynucleotides are generated. The expression levels of each of the marker genes listed in Table 5 or a subset thereof,

114 preferably at least five of the marker genes listed in Table 6, is determined for each tumor sample (i.e., for each patient) to generate a patient expression profile. Marker-derived polynucleotides from patients within the group clinically determined to have a good prognosis (i.e., no distant metastases within five years of initial diagnosis) are pooled and mean expression levels for each of the prognosis-related marker genes are determined to obtain a control expression profile. Patients are then rank ordered in descending order of similarity of patient expression profiles to the control expression profile to produce a rank-ordered list of patients, where the similarity is a value expressed by a single similarity metric such as a conelation coefficient. A first threshold similarity value is then selected, which divides the group of patients into those predicted to have a good prognosis and those predicted to have a poor prognosis. This first threshold similarity value maybe the similarity value that most accurately predicts clinical outcomes (i.e., results in an expression profile classification that results in the fewest misclassifications when compared to actual clinical outcomes), or a similarity value that results in a particular number or percentage of false negatives in the group, where a false negative is an expression-based good prognosis prediction for a breast cancer patient that actually develops a distant metastasis within the five year period after initial diagnosis. A second threshold similarity value is then selected which divides the "good prognosis" group into two groups. This threshold similarity value is determined empirically as the similarity value for the patient highest on the rank-ordered list of patients who actually develops a distant metastasis within the five-year period. This second threshold similarity value divides the "good prognosis" group into a group of patients having a "very good prognosis," i.e., those having similarity values equal to or higher than the second threshold similarity value, and an "intermediate prognosis" group, i.e., those having a similarity value equal to or greater than the first threshold similarity value, but less than the second threshold similarity value. Patients whose similarity values are less than the first threshold similarity value are classified as having a "poor prognosis." Subsequent patients may be similarly classified by calculating a similarity value for the patient, where the control is the "good prognosis" template or expression profile, and comparison of this similarity metric to the similarity metrics obtained above.

Thus, in one embodiment, the invention provides a method for classifying a breast cancer patient according to prognosis, comprising comparing the levels of expression of at least five of the genes for which markers are listed in Table 5 in a cell sample taken from said breast cancer patient to control levels of expression of said at least

115 five genes; and classifying said breast cancer patient according to prognosis of his or her breast cancer based on the similarity between said levels of expression in said cell sample and said control levels. In a more specific embodiment, the second step of this method comprises determining whether said similarity exceeds one or more predetermined threshold values of similarity. In another more specific embodiment of this method, said confrol levels are the mean levels of expression of each of said at least five genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have no distant metastases within five years of initial diagnosis. In another more specific embodiment of this method, said control levels comprise the expression levels of said genes in breast cancer patients who have had no distant metastases within five years of initial diagnosis. In yet another more specific embodiment of this method, said control levels comprise, for each of said at least five of the genes for which markers are listed in Table 5, mean log intensity values stored on a computer. In yet another more specific embodiment of this method, said confrol levels comprise, for each of said at least five of the genes for which markers are listed in Table 6, mean log intensity values stored on a computer. In another more specific embodiment of this method, said confrol levels comprise, for each of said at least five genes listed in Table 6, the mean log intensity values that are listed in Table 7. The set of mean log intensity values listed in this table may be used as a "good prognosis" template for any of the prognostic methods described herein. The above method may also compare the level of expression of at least ten, 20, 30, 40, 50, 75, 100 or more genes for which markers listed in Table 5, or may use the 70 prefened genes for which markers are listed in Table 6.

The present invention also provides for the classification of a breast cancer patient into one of three prognostic categories comprising (a) determining the similarity between the level of expression of at least five of the genes for which markers are listed in Table 5 to control levels of expression to obtain a patient similarity value; (b) providing a first threshold similarity value that differentiates persons having a good prognosis from those having a poor prognosis, and providing determining a second threshold similarity value, where said second threshold similarity value indicates a higher degree of similarity of the expression of said genes to said control than said first similarity value; and (c) classifying the breast cancer patient into a first prognostic category if the patient similarity value exceeds the first and second threshold similarity values, a second prognostic category if the patient similarity value equals or exceeds the first but not the second threshold similarity value, and a third prognostic category if the

116 patient similarity value is less than the first threshold similarity value. In a more specific embodiment, the levels of expression of each of said at least five genes is determined first. As above, the control comprises marker-related polynucleotides derived from breast cancer tumor samples taken from breast cancer patients clinically determined to have a good prognosis ("good prognosis" control), breast cancer patients clinically determined to have a poor prognosis "poor prognosis" control), or both. In a prefened embodiment, the control is a "good prognosis" control or template, i.e., a control or template comprising the mean levels of expression of said genes in breast cancer patients who have had no distant metastases within five years of initial diagnosis. In another more specific embodiment, said control levels comprise a set of values, for example mean log intensity values, preferably normalized, stored on a computer. In a more specific embodiment, said control or template is the set of mean log intensity values shown in Table 7. In another specific embodiment, said determining in step (a) may be accomplished by a method comprising determining the difference between the absolute expression level of each of said genes and the average expression level of the same genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis, hi another specific embodiment, said determining in step (a) may be accomplished by a method comprising determining the degree of similarity between the level of expression of each of said genes in a breast cancer tumor sample taken from a breast cancer patient and the level of expression of the same genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis.

In a specific embodiment of the above method, said first threshold similarity value and said second threshold similarity values are selected by a method comprising (a) rank ordering in descending order said tumor samples that compose said pool of tumor samples by the degree of similarity between the level of expression of said genes in each of said tumor samples to the mean level of expression of the same genes of the remaining tumor samples that compose said pool to obtain a rank-ordered list, said degree of similarity being expressed as a similarity value; (b) determining an acceptable number of false negatives in said classifying, wherein said false negatives are breast cancer patients for whom the expression levels of said at least five of the genes for which markers are listed in Table 5 in said cell sample predicts that said patient will have no distant metastases within the first five years after initial diagnosis, but who has had a distant metastasis within the first five years after initial diagnosis; (c) determining a

117 similarity value above which in said rank ordered list fewer than said acceptable number of tumor samples are false negatives; and (d) selecting said similarity value determined in step (c) as said first threshold similarity value; and (e) selecting a second similarity value, greater than said first similarity value, as said second threshold similarity value. In an even more specific embodiment of this method, said second threshold similarity value is selected in step (e) by a method comprising determining which of said tumor samples, taken from patients having a distant metastasis within five years of initial diagnosis, in said rank ordered list has the greatest similarity value, and selecting said greatest similarity value as said second threshold similarity value. In even more specific embodiments, said first and second threshold similarity values are conelation coefficients, and said first threshold similarity value is 0.4 and said second threshold similarity value is greater than 0.4. In another even more specific embodiment, using the template data provided in Table 7, said first and second threshold similarity values are conelation coefficients, and said second threshold similarity value is 0.636. hi another specific embodiment, said first similarity value is a similarity value above which at most 10% false negatives are predicted in a training set of tumors, and said second conelation coefficient is a coefficient above which at most 5% false negatives are predicted in said training set of tumors. In another specific embodiment, said first conelation coefficient is a coefficient above which 10% false negatives are predicted in a training set of tumors, and said second conelation coefficient is a coefficient above which no false negatives are predicted in said training set of tumors. In the above and other embodiments, "false negatives" are patients classified by the expression of the marker genes as having a good prognosis, or who are predicted by such expression to have a good prognosis, but who actually do develop distant metastases within five years. In a specific embodiment of the above methods, the first, second and third prognostic categories are "very good prognosis," "intermediate prognosis," and "poor prognosis," respectively. Patients classified into the first prognostic category ("very good prognosis") are likely not to have a distant metastasis within five years of initial diagnosis. Patients classified as having an "intermediate prognosis" are also unlikely to have a distant metastasis within five years of initial diagnosis, but may be recommended to undergo a different therapeutic regimen than patients having a "very good prognosis" marker gene expression profile (see below). Patients classified into the third prognostic category ("poor prognosis") are likely to have a distant metastasis within five years of initial diagnosis.

118 In a more specific embodiment, the similarity value is the degree of difference between the absolute (i.e., untransformed) level of expression of each of the genes in a tumor sample taken from a breast cancer patient and the mean absolute level of expression of the same genes in a control. In another more specific embodiment, the similarity value is calculated using expression level data that is transformed (see Section 5.4.3). In another more specific embodiment, the similarity value is expressed as a similarity metric, such as a conelation coefficient, representing the similarity between the level of expression of the marker genes in the tumor sample and the mean level of expression of the same genes in a plurality of breast cancer tumor samples taken from breast cancer patients.

In another specific embodiment, said first and second similarity values are derived from control expression data obtained in the same hybridization experiment as that in which the patient expression level data is obtained. In another specific embodiment, said first and second similarity values are derived from an existing set of expression data. In a more specific embodiment, said first and second conelation coefficients are derived from a mathematical sample pool (see Section 5.4.3; Example 9). For example, comparison of the expression of marker genes in new tumor samples may be compared to the pre-existing template determined for these genes for the 78 patients in the initial study; the template, or average expression levels of each of the seventy genes can be used as a reference or control for any tumor sample. Preferably, the comparison is made to a template comprising the average expression level of at least five of the 70 genes listed in Table 6 for the 44 out of 78 patients clinically determined to have a good prognosis. The coefficient of conelation of the level of expression of these genes in the tumor sample to the 44 "good prognosis" patient template is then determined to produce a tumor conelation coefficient. For this control patient set, two similarity values have been derived: a first conelation coefficient of 0.4 and a second conelation coefficient of 0.636, derived using the 70 marker gene set listed in Table 6. New breast cancer patients whose coefficients of conelation of the expression of these marker genes with the 44-patient "good prognosis" template equal or exceed 0.636 are classified as having a "very good prognosis"; those having a coefficient of conelation of between 0.4 and 0.635 are classified as having an "intermediate prognosis"; and those having a conelation coefficient of 0.39 or less are classified as having a "poor prognosis."

Because the above methods may utilize anays to which fluorescently- labeled marker-derived target nucleic acids are hybridized, the invention also provides a

119 method of classifying a breast cancer patient according to prognosis comprising the steps of (a) contacting first nucleic acids derived from a tumor sample taken from said breast cancer patient, and second nucleic acids derived from two or more tumor samples from breast cancer patients who have had no distant metastases within five years of initial diagnosis, with an anay under conditions such that hybridization can occur, detecting at each of a plurality of discrete loci on said anay a first fluorescent emission signal from said first nucleic acids and a second fluorescent emission signal from said second nucleic acids that are bound to said anay under said conditions, wherein said anay comprises at least five of the genes for which markers are listed in Table 5 and wherein at least 50% of the probes on said anay are listed in Table 5; (b) calculating the similarity between said first fluorescent emission signals and said second fluorescent emission signals across said at least five genes; and (c) classifying said breast cancer patient according to prognosis of his or her breast cancer based on the similarity between said first fluorescent emission signals and said second fluorescent emission signals across said at least five genes. Once patients have been classified as having a "very good prognosis,"

"intermediate prognosis" or "poor prognosis," this information can be combined with the patient's clinical data to determine an appropriate treatment regimen. In one embodiment, the patient's lymph node metastasis status (i.e., whether the patient is pN+ or pNO) is determined. Patients who are pNO and have a "very good prognosis" or "intermediate" expression profile may be treated without adjuvant chemotherapy. All other patients should be treated with adjuvant chemotherapy. In a more specific embodiment, the patient's estrogen receptor status is also identified (i.e., whether the patient is ER(+) or ER(-)). Here, patients classified as having an "intermediate prognosis" or "poor prognosis" who are ER(+) are assigned a therapeutic regimen that additionally comprises adjuvant hormonal therapy.

Thus, the invention provides for a method of assigning a therapeutic regimen to a breast cancer patient, comprising (a) classifying said patient as having a "poor prognosis," "intermediate prognosis," or "very good prognosis" on the basis of the levels of expression of at least five of the genes for which markers are listed in Table 5; and (b) assigning said patient a therapeutic regimen, said therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or comprising chemotherapy if said patient has any other combination of lymph node status and expression profile. In another embodiment, the invention provides a method for assigning a therapeutic regimen

120 for a breast cancer patient, comprising determining the lymph node status for said patient; determining the level of expression of at least five of the genes listed in Table 5 in a tumor sample from said patient, thereby generating an expression profile; classifying said patient as having a "poor prognosis", "intermediate prognosis" or "very good prognosis" on the basis of said expression profile; and assigning the patient a therapeutic regimen, said therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or a therapeutic regiment comprising chemotherapy if said patient has any other combination of lymph node status and expression profile. In a more specific embodiment of the above methods, the ER status of the patient is additionally detemiined, and if the breast cancer patient is ER(+) and has an intermediate or poor prognosis, the therapeutic regimen additionally comprises hormonal therapy. Because in the training set of 78 breast cancer patients it was determined that the great majority of intermediate prognosis patients were also ER(+) (see Example 10), another more specific embodiment is to determine the lymph node status and expression profiles, and to assign intermediate prognosis patients adjuvant hormonal therapy (whether or not ER status has been determined). In another specific embodiment, the breast cancer patient is 52 years of age or younger. In another specific embodiment, the breast cancer patient is premenopausal. In another specific embodiment, the breast cancer patient has stage I or stage II breast cancer.

The use of marker sets is not restricted to the prognosis of breast cancer- related conditions, and may be applied in a variety of phenotypes or conditions, clinical or experimental, in which gene expression plays a role. Where a set of markers has been identified that conesponds to two or more phenotypes, the marker set can be used to distinguish these phenotypes. For example, the phenotypes may be the diagnosis and/or prognosis of clinical states or phenotypes associated with other cancers, other disease conditions, or other physiological conditions, wherein the expression level data is derived from a set of genes conelated with the particular physiological or disease condition. Further, the expression of markers specific to other types of cancer may be used to differentiate patients or patient populations for those cancers for which different therapeutic regimens are indicated.

121 5.4.3 IMPROVING SENSITIVITY TO EXPRESSION LEVEL DIFFERENCES In using the markers disclosed herein, and, indeed, using any sets of markers to differentiate an individual having one phenotype from another individual having a second phenotype, one can compare the absolute expression of each of the markers in a sample to a control; for example, the control can be the average level of expression of each of the markers, respectively, in a pool of individuals. To increase the sensitivity of the comparison, however, the expression level values are preferably transformed in a number of ways.

For example, the expression level of each of the markers can be normalized by the average expression level of all markers the expression level of which is determined, or by the average expression level of a set of control genes. Thus, in one embodiment, the markers are represented by probes on a microanay, and the expression level of each of the markers is normalized by the mean or median expression level across all of the genes represented on the microanay, including any non-marker genes. In a specific embodiment, the normalization is carried out by dividing the median or mean level of expression of all of the genes on the microanay. In another embodiment, the expression levels of the markers is nonnalized by the mean or median level of expression of a set of control markers. In a specific embodiment, the control markers comprise a set of housekeeping genes. In another specific embodiment, the normalization is accomplished by dividing by the median or mean expression level of the control genes. The sensitivity of a marker-based assay will also be increased if the expression levels of individual markers are compared to the expression of the same markers in a pool of samples. Preferably, the comparison is to the mean or median expression level of each the marker genes in the pool of samples. Such a comparison may be accomplished, for example, by dividing by the mean or median expression level of the pool for each of the markers from the expression level each of the markers in the sample. This has the effect of accentuating the relative differences in expression between markers in the sample and markers in the pool as a whole, making comparisons more sensitive and more likely to produce meaningful results that the use of absolute expression levels alone. The expression level data may be transformed in any convenient way; preferably, the expression level data for all is log transformed before means or medians are taken.

In performing comparisons to a pool, two approaches may be used. First, the expression levels of the markers in the sample maybe compared to the expression

122 level of those markers in the pool, where nucleic acid derived from the sample and nucleic acid derived from the pool are hybridized during the course of a single experiment. Such an approach requires that new pool nucleic acid be generated for each comparison or limited numbers of comparisons, and is therefore limited by the amount of nucleic acid available. Alternatively, and preferably, the expression levels in a pool, whether normalized and/or transformed or not, are stored on a computer, or on computer- readable media, to be used in comparisons to the individual expression level data from the sample (i.e., single-channel data).

Thus, the cunent invention provides the following method of classifying a first cell or organism as having one of at least two different phenotypes, where the different phenotypes comprise a first phenotype and a second phenotype. The level of expression of each of a plurality of genes in a first sample from the first cell or organism is compared to the level of expression of each of said genes, respectively, in a pooled sample from a plurality of cells or organisms, the plurality of cells or organisms comprising different cells or organisms exhibiting said at least two different phenotypes, respectively, to produce a first compared value. The first compared value is then compared to a second compared value, wherein said second compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having said first phenotype to the level of expression of each of said genes, respectively, in the pooled sample. The first compared value is then compared to a third compared value, wherein said third compared value is the product of a method comprising comparing the level of expression of each of the genes in a sample from a cell or organism characterized as having the second phenotype to the level of expression of each of the genes, respectively, in the pooled sample. Optionally, the first compared value can be compared to additional compared values, respectively, where each additional compared value is the product of a method comprising comparing the level of expression of each of said genes in a sample from a cell or organism characterized as having a phenotype different from said first and second phenotypes but included among the at least two different phenotypes, to the level of expression of each of said genes, respectively, in said pooled sample. Finally, a determination is made as to which of said second, third, and, if present, one or more additional compared values, said first compared value is most similar, wherein the first cell or organism is determined to have the phenotype of the cell or organism used to produce said compared value most similar to said first compared value.

123 In a specific embodiment of this method, the compared values are each ratios of the levels of expression of each of said genes. In another specific embodiment, each of the levels of expression of each of the genes in the pooled sample are normalized prior to any of the comparing steps. In a more specific embodiment, the normalization of the levels of expression is carried out by dividing by the median or mean level of the expression of each of the genes or dividing by the mean or median level of expression of one or more housekeeping genes in the pooled sample from said cell or organism. In another specific embodiment, the normalized levels of expression are subjected to a log transform, and the comparing steps comprise subtracting the log transform from the log of the levels of expression of each of the genes in the sample. In another specific embodiment, the two or more different phenotypes are different stages of a disease or disorder. In still another specific embodiment, the two or more different phenotypes are different prognoses of a disease or disorder, hi yet another specific embodiment, the levels of expression of each of the genes, respectively, in the pooled sample or said levels of expression of each of said genes in a sample from the cell or organism characterized as having the first phenotype, second phenotype, or said phenotype different from said first and second phenotypes, respectively, are stored on a computer or on a computer-readable medium.

In another specific embodiment, the two phenotypes are ER(+) or ER(-) status. In another specific embodiment, the two phenotypes are BRCAl or sporadic tumor-type status. In yet another specific embodiment, the two phenotypes are good prognosis and poor prognosis.

In another specific embodiment, the comparison is made between the expression of each of the genes in the sample and the expression of the same genes in a pool representing only one of two or more phenotypes. In the context of prognosis- conelated genes, for example, one can compare the expression levels of prognosis-related genes in a sample to the average level of the expression of the same genes in a "good prognosis" pool of samples (as opposed to a pool of samples that include samples from patients having poor prognoses and good prognoses). Thus, in this method, a sample is classified as having a good prognosis if the level of expression of prognosis-conelated genes exceeds a chosen coefficient of conelation to the average "good prognosis" expression profile (i.e., the level of expression of prognosis-conelated genes in a pool of samples from patients having a "good prognosis." Patients whose expression levels conelate more poorly with the "good prognosis" expression profile (i.e., whose

124 conelation coefficient fails to exceed the chosen coefficient) are classified as having a poor prognosis. The method can be applied to subdivisions of these prognostic classes. For example, in a specific embodiment, the phenotype is good prognosis and said determination comprises (1) determining the coefficient of conelation between the expression of said plurality of genes in the sample and of the same genes in said pooled sample; (2) selecting a first conelation coefficient value between 0.4 and +1 and a second conelation coefficient value between 0.4 and +1, wherein said second value is larger than said first value; and (3) classifying said sample as "very good prognosis" if said coefficient of conelation equals or is greater than said second conelation coefficient value, "intermediate prognosis" if said coefficient of conelation equals or exceeds said first conelation coefficient value, and is less than said second conelation coefficient value, or "poor prognosis" if said coefficient of conelation is less than said first conelation coefficient value.

Of course, single-channel data may also be used without specific comparison to a mathematical sample pool. For example, a sample may be classified as having a first or a second phenotype, wherein the first and second phenotypes are related, by calculating the similarity between the expression of at least 5 markers in the sample, where the markers are conelated with the first or second phenotype, to the expression of the same markers in a first phenotype template and a second phenotype template, by (a) labeling nucleic acids derived from a sample with a fluorophore to obtain a pool of fluorophore-labeled nucleic acids; (b) contacting said fluorophore-labeled nucleic acid with a microanay under conditions such that hybridization can occur, detecting at each of a plurality of discrete loci on the microanay a flourescent emission signal from said fluorophore-labeled nucleic acid that is bound to said microanay under said conditions; and (c) determining the similarity of marker gene expression in the individual sample to the first and second templates, wherein if said expression is more similar to the first template, the sample is classified as having the first phenotype, and if said expression is more similar to the second template, the sample is classified as having the second phenotype.

5.5 DETERMINATION OF MARKER GENE EXPRESSION LEVELS

5.5.1 METHODS The expression levels of the marker genes in a sample may be determined by any means known in the art. The expression level may be determined by isolating and

125 determining the level (i.e., amount) of nucleic acid transcribed from each marker gene. Alternatively, or additionally, the level of specific proteins translated from mRNA transcribed from a marker gene may be determined.

The level of expression of specific marker genes can be accomplished by determining the amount of mRNA, or polynucleotides derived therefrom, present in a sample. Any method for determining RNA levels can be used. For example, RNA is isolated from a sample and separated on an agarose gel. The separated RNA is then transfened to a solid support, such as a filter. Nucleic acid probes representing one or more markers are then hybridized to the filter by northern hybridization, and the amount of marker-derived RNA is determined. Such detenriination can be visual, or machine- aided, for example, by use of a densitometer. Another method of determining RNA levels is by use of a dot-blot or a slot-blot, hi this method, RNA, or nucleic acid derived therefrom, from a sample is labeled. The RNA or nucleic acid derived therefrom is then hybridized to a filter containing oligonucleotides derived from one or more marker genes, wherein the oligonucleotides are placed upon the filter at discrete, easily-identifiable locations. Hybridization, or lack thereof, of the labeled RNA to the filter-bound oligonucleotides is determined visually or by densitometer. Polynucleotides can be labeled using a radiolabel or a fluorescent (i.e., visible) label.

These examples are not intended to be limiting; other methods of determining RNA abundance are known in the art.

The level of expression of particular marker genes may also be assessed by determining the level of the specific protein expressed from the marker genes. This can be accomplished, for example, by separation of proteins from a sample on a polyacrylamide gel, followed by identification of specific marker-derived proteins using antibodies in a western blot. Alternatively, proteins can be separated by two-dimensional gel electrophoresis systems. Two-dimensional gel electrophoresis is well-known in the art and typically involves isoelectric focusing along a first dimension followed by SDS-PAGE electrophoresis along a second dimension. See, e.g., Hames et al, 1990, GEL ELECTROPHORESIS OF PROTEINS: A PRACTICAL APPROACH, ERL Press, New York; Shevchenko et al, Proc. Nat'lAcad. Sci. USA 93:1440-1445 (1996); Sagliocco et al., Yeast 12:1519-1533 (1996); Lander, Science 274:536-539 (1996). The resulting electropherograms can be analyzed by numerous techniques, including mass spectrometric techniques, western blotting and immunoblot analysis using polyclonal and monoclonal antibodies.

126 Alternatively, marker-derived protein levels can be determined by constructing an antibody microanay in which binding sites comprise immobilized, preferably monoclonal, antibodies specific to a plurality of protein species encoded by the cell genome. Preferably, antibodies are present for a substantial fraction of the marker- derived proteins of interest. Methods for making monoclonal antibodies are well known (see, e.g., Harlow and Lane, 1988, ANTIBODIES: A LABORATORY MANUAL, Cold Spring Harbor, New York, which is incorporated in its entirety for all purposes). In one embodiment, monoclonal antibodies are raised against synthetic peptide fragments designed based on genomic sequence of the cell. With such an antibody anay, proteins from the cell are contacted to the anay. and their binding is assayed with assays known in the art. Generally, the expression, and the level of expression, of proteins of diagnostic or prognostic interest can be detected through immunohistochemical staining of tissue slices or sections.

Finally, expression of marker genes in a number of tissue specimens may be characterized using a "tissue anay" (Kononen et al, Nat. Med 4(7):844-7 (1998)). In a tissue anay, multiple tissue samples are assessed on the same microanay. The anays allow in situ detection of RNA and protein levels; consecutive sections allow the analysis of multiple samples simultaneously.

5.5.2 MICROARRAYS In prefened embodiments, polynucleotide microanays are used to measure expression so that the expression status of each of the markers above is assessed simultaneously. In a specific embodiment, the invention provides for oligonucleotide or cDNA anays comprising probes hybridizable to the genes conesponding to each of the marker sets described above (i.e., markers to determine the molecular type or subtype of a tumor; markers to distinguish ER status; markers to distinguish BRCAl from sporadic tumors; markers to distinguish patients with good versus patients with poor prognosis; markers to distinguish both ER(+) from ER(-), and BRCAl tumors from sporadic tumors; markers to distinguish ER(+) from ER(-), and patients with good prognosis from patients with poor prognosis; markers to distinguish BRCAl tumors from sporadic tumors, and patients with good prognosis from patients with poor prognosis; and markers able to distinguish ER(+) from ER(-), BRCAl tumors from sporadic tumors, and patients with good prognosis from patients with poor prognosis; and markers unique to each status).

127 The microanays provided by the present invention may comprise probes hybridizable to the genes conesponding to markers able to distinguish the status of one, two, or all three of the clinical conditions noted above. In particular, the invention provides polynucleotide anays comprising probes to a subset or subsets of at least 50, 100, 200, 300, 400, 500, 750, 1,000, 1,250, 1,500, 1,750, 2,000 or 2,250 genetic markers, up to the full set of 2,460 markers, which distinguish ER(+) and ER(-) patients or tumors. The invention also provides probes to subsets of at least 20, 30, 40, 50, 75, 100, 150, 200, 250, 300, 350 or 400 markers, up to the full set of 430 markers, which distinguish between tumors containing a BRCAl mutation and sporadic tumors within an ER(-) group of tumors. The invention also provides probes to subsets of at least 20, 30, 40, 50, 75, 100, 150 or 200 markers, up to the full set of 231 markers, which distinguish between patients with good and poor prognosis within sporadic tumors. In a specific embodiment, the anay comprises probes to marker sets or subsets directed to any two of the clinical conditions, h a more specific embodiment, the anay comprises probes to marker sets or subsets directed to all three clinical conditions.

In specific embodiments, the invention provides polynucleotide anays in which the breast cancer-related markers described herein comprise at least 50%, 60%, 70%, 80%, 85%, 90%, 95% or 98% of the probes on said anay. In another specific embodiment, the invention provides polynucleotide anays in which ER status-related markers selected from Table 1 comprise at least 50%, 60%, 70%, 80%, 85%, 90%, 95% or 98% of the probes on said anay. In another specific embodiment, the invention provides polynucleotide anays in which BRCAl I sporadic markers selected from Table 3 comprise at least 50%, 60%, 70%, 80%, 85%, 90%, 95% or 98% of the probes on said anay. In another specific embodiment, the invention provides polynucleotide anays in which prognostic markers selected from Table 5 comprise at least 50%, 60%, 70%, 80%, 85%, 90%, 95% or 98% of the probes on said anay.

In yet another specific embodiment, microanays that are used in the methods disclosed herein optionally comprise markers additional to at least some of the markers listed in Tables 1-6. For example, in a specific embodiment, the microanay is a screening or scanning anay as described in Altschuler et al., International Publication

WO 02/18646, published March 7, 2002 and Scherer et al, International Publication WO 02/16650, published February 28, 2002. The scanning and screening anays comprise regularly-spaced, positionally-addressable probes derived from genomic nucleic acid sequence, both expressed and unexpressed. Such anays may comprise probes

128 conesponding to a subset of, or all of, the markers listed in Tables 1-6, or a subset thereof as described above, and can be used to monitor marker expression in the same way as a microanay containing only markers listed in Tables 1-6.

In yet another specific embodiment, the microanay is a commercially- available cDNA microanay that comprises at least five of the markers listed in Tables 1- 6. Preferably, a commercially-available cDNA microanay comprises all of the markers listed in Tables 1-6. However, such a microanay may comprise 5, 10, 15, 25, 50, 100, 150, 250, 500, 1000 or more of the markers in any of Tables 1-6, up to the maximum number of markers in a Table, and may comprise all of the markers in any one of Tables 1-6 and a subset of another of Tables 1-6, or subsets of each as described above. In a specific embodiment of the microanays used in the methods disclosed herein, the markers that are all or a portion of Tables 1-6 make up at least 50%, 60%, 70%, 80%, 90%, 95% or 98%o of the probes on the microanay.

General methods pertaining to the construction of microanays comprising the marker sets and/or subsets above are described in the following sections.

5.5.2.1 CONSTRUCTION OF MICROARRAYS Microarrays are prepared by selecting probes which comprise a polynucleotide sequence, and then immobilizing such probes to a solid support or surface. For example, the probes may comprise DNA sequences, RNA sequences, or copolymer sequences of DNA and RNA. The polynucleotide sequences of the probes may also comprise DNA and/or RNA analogues, or combinations thereof. For example, the polynucleotide sequences of the probes may be full or partial fragments of genomic DNA. The polynucleotide sequences of the probes may also be synthesized nucleotide sequences, such as synthetic oligonucleotide sequences. The probe sequences can be synthesized either enzymatically in vivo, enzymatically in vitro (e.g., by PCR), or non- enzymatically in vitro.

The probe or probes used in the methods of the invention are preferably immobilized to a solid support which may be either porous or non-porous. For example, the probes of the invention may be polynucleotide sequences which are attached to a nitrocellulose or nylon membrane or filter covalently at either the 3 ' or the 5 ' end of the polynucleotide. Such hybridization probes are well known in the art (see, e.g., Sambrook et al., MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989). Alternatively, the

129 solid support or surface may be a glass or plastic surface. In a particularly prefened embodiment, hybridization levels are measured to microanays of probes consisting of a solid phase on the surface of which are immobilized a population of polynucleotides, such as a population of DNA or DNA mimics, or, alternatively, a population of RNA or RNA mimics. The solid phase may be a nonporous or, optionally, a porous material such as a gel.

In prefened embodiments, a microarray comprises a support or surface with an ordered anay of binding (e.g., hybridization) sites or "probes" each representing one of the markers described herein. Preferably the microanays are addressable anays, and more preferably positionally addressable anays. More specifically, each probe of the anay is preferably located at a known, predetermined position on the solid support such that the identity (i.e., the sequence) of each probe can be determined from its position in the anay (i.e., on the support or surface). In prefened embodiments, each probe is covalently attached to the solid support at a single site. Microanays can be made in a number of ways, of which several are described below. However produced, microanays share certain characteristics. The anays are reproducible, allowing multiple copies of a given anay to be produced and easily compared with each other. Preferably, microanays are made from materials that are stable under binding (e.g., nucleic acid hybridization) conditions. The microanays are

9 9 preferably small, e.g., between 1 cm and 25 cm , between 12 cm and 13 cm , or 3 cm . However, larger anays are also contemplated and may be preferable, e.g., for use in screening anays. Preferably, a given binding site or unique set of binding sites in the microanay will specifically bind (e.g. , hybridize) to the product of a single gene in a cell (e.g., to a specific mRNA, or to a specific cDNA derived therefrom). However, in general, other related or similar sequences will cross hybridize to a given binding site. The microanays of the present invention include one or more test probes, each of which has a polynucleotide sequence that is complementary to a subsequence of RNA or DNA to be detected. Preferably, the position of each probe on the solid surface is known. Indeed, the microanays are preferably positionally addressable anays. Specifically, each probe of the anay is preferably located at a known, predetermined position on the solid support such that the identity (i.e., the sequence) of each probe can be determined from its position on the anay (i.e., on the support or surface).

According to the invention, the microanay is an anay (i.e., a matrix) in which each position represents one of the markers described herein. For example, each

130 position can contain a DNA or DNA analogue based on genomic DNA to which a particular RNA or cDNA transcribed from that genetic marker can specifically hybridize. The DNA or DNA analogue can be, e.g., a synthetic oligomer or a gene fragment. In one embodiment, probes representing each of the markers is present on the anay. hi a prefened embodiment, the anay comprises the 550 of the 2,460 RE-status markers, 70 of the Ei?C42/sporadic markers, and all 231 of the prognosis markers.

5.5.2.2 PREPARING PROBES FOR MICROARRAYS As noted above, the "probe" to which a particular polynucleotide molecule specifically hybridizes according to the invention contains a complementary genomic polynucleotide sequence. The probes of the microanay preferably consist of nucleotide sequences of no more than 1,000 nucleotides. h some embodiments, the probes of the anay consist of nucleotide sequences of 10 to 1,000 nucleotides. hi a prefened embodiment, the nucleotide sequences of the probes are in the range of 10-200 nucleotides in length and are genomic sequences of a species of organism, such that a plurality of different probes is present, with sequences complementary and thus capable of hybridizing to the genome of such a species of organism, sequentially tiled across all or a portion of such genome. In other specific embodiments, the probes are in the range of 10-30 nucleotides in length, in the range of 10-40 nucleotides in length, in the range of 20-50 nucleotides in length, in the range of 40-80 nucleotides in length, in the range of 50-150 nucleotides in length, in the range of 80-120 nucleotides in length, and most preferably are 60 nucleotides in length.

The probes may comprise DNA or DNA "mimics" (e.g., derivatives and analogues) conesponding to a portion of an organism's genome, hi another embodiment, the probes of the microanay are complementary RNA or RNA mimics. DNA mimics are polymers composed of subunits capable of specific, Watson-Crick-like hybridization with DNA, or of specific hybridization with RNA. The nucleic acids can be modified at the base moiety, at the sugar moiety, or at the phosphate backbone. Exemplary DNA mimics include, e.g., phosphorothioates.

DNA can be obtained, e.g., by polymerase chain reaction (PCR) amplification of genomic DNA or cloned sequences. PCR primers are preferably chosen based on a known sequence of the genome that will result in amplification of specific fragments of genomic DNA. Computer programs that are well known in the art are useful in the design of primers with the required specificity and optimal amplification properties,

131 such as Oligo version 5.0 (National Biosciences). Typically each probe on the microanay will be between 10 bases and 50,000 bases, usually between 300 bases and 1,000 bases in length. PCR methods are well known in the art, and are described, for example, in h nis et al, eds., PCR PROTOCOLS: A GUIDE TO METHODS AND APPLICATIONS, Academic Press Inc., San Diego, CA (1990). It will be apparent to one skilled in the art that controlled robotic systems are useful for isolating and amplifying nucleic acids.

An alternative, prefened means for generating the polynucleotide probes of the microanay is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N-phosphonate or phosphoramidite chemistries (Froehler et al, Nucleic Acid Res. 14:5399-5407 (1986); McBride et al, Tetrahedron Lett. 24:246-248 (1983)). Synthetic sequences are typically between about 10 and about 500 bases in length, more typically between about 20 and about 100 bases, and most preferably between about 40 and about 70 bases in length. In some embodiments, synthetic nucleic acids include non-natural bases, such as, but by no means limited to, inosine. As noted above, nucleic acid analogues may be used as binding sites for hybridization. An example of a suitable nucleic acid analogue is peptide nucleic acid (see, e.g., Egholm et al, Nature 363:566- 568 (1993); U.S. Patent No. 5,539,083).

Probes are preferably selected using an algorithm that takes into account binding energies, base composition, sequence complexity, cross-hybridization binding energies, and secondary structure (see Friend et al, International Patent Publication WO 01/05935, published January 25, 2001; Hughes et al, Nat. Biotech. 19:342-7 (2001)). A skilled artisan will also appreciate that positive control probes, e.g., probes known to be complementary and hybridizable to sequences in the target polynucleotide molecules, and negative control probes, e.g., probes known to not be complementary and hybridizable to sequences in the target polynucleotide molecules, should be included on the array, hi one embodiment, positive controls are synthesized along the perimeter of the anay. In another embodiment, positive controls are synthesized in diagonal stripes across the anay. In still another embodiment, the reverse complement for each probe is synthesized next to the position of the probe to serve as a negative control. In yet another embodiment, sequences from other species of organism are used as negative controls or as "spike-in" controls.

132 5.5.2.3 ATTACHING PROBES TO THE SOLED SURFACE The probes are attached to a solid support or surface, wliich may be made, e.g., from glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, gel, or other porous or nonporous material. A prefened method for attaching the nucleic acids to a surface is by printing on glass plates, as is described generally by Schena et al,

Science 270:467-470 (1995). This method is especially useful for preparing microanays of cDNA (See also, DeRisi et al, Nature Genetics 14:457-460 (1996); Shalon et al, Genome Res. (5:639-645 (1996); and Schena et al, Proc. Natl. Acad. Sci. U.S.A. 93:10539-11286 (1995)). A second prefened method for making microanays is by making high- density oligonucleotide anays. Techniques are known for producing arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface using photolithographic techniques for synthesis in situ (see, Fodor et al, 1991, Science 251:767-773; Pease et al, 1994, Proc. Natl. Acad. Sci. U.S.A. 91:5022- 5026; Lockhart et al, 1996, Nature Biotechnology 14:1675; U.S. Patent Nos. 5,578,832; 5,556,752; and 5,510,270) or other methods for rapid synthesis and deposition of defined oligonucleotides (Blanchard et al, Biosensors & Bioelectronics 11 :687-690). When these methods are used, oligonucleotides (e.g., 60-mers) of known sequence are synthesized directly on a surface such as a derivatized glass slide. Usually, the anay produced is redundant, with several oligonucleotide molecules per RNA.

Other methods for making microanays, e.g., by masking (Maskos and Southern, 1992, Nuc. Acids. Res. 20:1679-1684), may also be used. In principle, and as noted supra, any type of anay, for example, dot blots on a nylon hybridization membrane (see Sambrook et al, MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989)) could be used. However, as will be recognized by those skilled in the art, very small anays will frequently be prefened because hybridization volumes will be smaller.

In one embodiment, the anays of the present invention are prepared by synthesizing polynucleotide probes on a support, hi such an embodiment, polynucleotide probes are attached to the support covalently at either the 3 ' or the 5 ' end of the polynucleotide.

In a particularly prefened embodiment, microanays of the invention are manufactured by means of an ink jet printing device for oligonucleotide synthesis, e.g., using the methods and systems described by Blanchard in U.S. Pat. No. 6,028,189;

133 Blanchard et al, 1996, Biosensors and Bioelectronics 11:687-690; Blanchard, 1998, in SYNTHETIC DNA ARRAYS IN GENETIC ENGINEERING, Vol. 20, J.K. Setlow, Ed., Plenum Press, New York at pages 111-123. Specifically, the oligonucleotide probes in such microanays are preferably synthesized in anays, e.g., on a glass slide, by serially depositing individual nucleotide bases in "microdroplets" of a high surface tension solvent such as propylene carbonate. The microdroplets have small volumes (e.g., 100 pL or less, more preferably 50 pL or less) and are separated from each other on the microanay (e.g., by hydrophobic domains) to form circular surface tension wells which define the locations of the anay elements (i.e., the different probes). Microanays manufactured by this ink-jet method are typically of high density, preferably having a density of at least about 2,500 different probes per 1 cm². The polynucleotide probes are attached to the support covalently at either the 3 ' or the 5 ' end of the polynucleotide.

5.5.2.4 TARGET POLYNUCLEOTIDE MOLECULES The polynucleotide molecules which may be analyzed by the present invention (the "target polynucleotide molecules") may be from any clinically relevant source, but are expressed RNA or a nucleic acid derived therefrom (e.g., cDNA or amplified RNA derived from cDNA that incorporates an RNA polymerase promoter), including naturally occurring nucleic acid molecules, as well as synthetic nucleic acid molecules. In one embodiment, the target polynucleotide molecules comprise RNA, including, but by no means limited to, total cellular RNA, poly(A)⁺ messenger RNA (mRNA) or fraction thereof, cytoplasmic mRNA, or RNA transcribed from cDNA (i.e., cRNA; see, e.g., Linsley & Schelter, U.S. Patent Application No. 09/411,074, filed October 4, 1999, or U.S. Patent Nos. 5,545,522, 5,891,636, or 5,716,785). Methods for preparing total and poly(A)⁺ RNA are well known in the art, and are described generally, e.g. , in Sambrook et al. , MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989). In one embodiment, RNA is extracted from cells of the various types of interest in this invention using guanidinium thiocyanate lysis followed by CsCl centrifugation (Chirgwin et al, 1979, Biochemistry 18:5294-5299). In another embodiment, total RNA is extracted using a silica gel-based column, commercially available examples of which include

RNeasy (Qiagen, Valencia, California) and StrataPrep (Stratagene, La Jolla, California). In an alternative embodiment, which is prefened for S. cerevisiae, RNA is extracted from cells using phenol and chloroform, as described in Ausubel et al, eds., 1989, CURRENT

134 PROTOCOLS IN MOLECULAR BIOLOGY, Vol III, Green Publishing Associates, Inc., John Wiley & Sons, Inc., New York, at pp. 13.12.1-13.12.5). Poly(A)⁺ RNA can be selected, e.g., by selection with oligo-dT cellulose or, alternatively, by oligo-dT primed reverse transcription of total cellular RNA. In one embodiment, RNA can be fragmented by methods known in the art, e.g., by incubation with ZnCl₂, to generate fragments of RNA. In another embodiment, the polynucleotide molecules analyzed by the invention comprise cDNA, or PCR products of amplified RNA or cDNA.

In one embodiment, total RNA, mRNA, or nucleic acids derived therefrom, is isolated from a sample taken from a person afflicted with breast cancer. Target polynucleotide molecules that are poorly expressed in particular cells may be enriched using normalization techniques (Bonaldo et al, 1996, Genome Res. 6:791-806).

As described above, the target polynucleotides are detectably labeled at one or more nucleotides. Any method known in the art may be used to detectably label the target polynucleotides. Preferably, this labeling incorporates the label uniformly along the length of the RNA, and more preferably, the labeling is carried out at a high degree of efficiency. One embodiment for this labeling uses oligo-dT primed reverse transcription to incorporate the label; however, conventional methods of this method are biased toward generating 3' end fragments. Thus, in a prefened embodiment, random primers (e.g., 9-mers) are used in reverse transcription to uniformly incorporate labeled nucleotides over the full length of the target polynucleotides. Alternatively, random primers may be used in conjunction with PCR methods or T7 promoter-based in vitro transcription methods in order to amplify the target polynucleotides. hi a prefened embodiment, the detectable label is a luminescent label. For example, fluorescent labels, bioluminescent labels, chemiluminescent labels, and colorimetric labels may be used in the present invention. In a highly prefened embodiment, the label is a fluorescent label, such as a fluorescein, a phosphor, a rhodamine, or a polymethine dye derivative. Examples of commercially available fluorescent labels include, for example, fluorescent phosphoramidites such as FluorePrime (Amersham Pharmacia, Piscataway, NJ.), Fluoredite (Millipore, Bedford, Mass.), FAM (ABI, Foster City, Calif), and Cy3 or Cy5 (Amersham Pharmacia, Piscataway, NJ.). In another embodiment, the detectable label is a radiolabeled nucleotide.

In a further prefened embodiment, target polynucleotide molecules from a patient sample are labeled differentially from target polynucleotide molecules of a

135 standard. The standard can comprise target polynucleotide molecules from normal individuals (i.e., those not afflicted with breast cancer). In a highly prefened embodiment, the standard comprises target polynucleotide molecules pooled from samples from normal individuals or tumor samples from individuals having sporadic-type breast tumors. In another embodiment, the target polynucleotide molecules are derived from the same individual, but are taken at different time points, and thus indicate the efficacy of a treatment by a change in expression of the markers, or lack thereof, during and after the course of treatment (i.e., chemotherapy, radiation therapy or cryotherapy), wherein a change in the expression of the markers from a poor prognosis pattern to a good prognosis pattern indicates that the treatment is efficacious. In this embodiment, different timepoints are differentially labeled.

5.5.2.5 HYBRIDIZATION TO MICROARRAYS Nucleic acid hybridization and wash conditions are chosen so that the target polynucleotide molecules specifically bind or specifically hybridize to the complementary polynucleotide sequences of the anay, preferably to a specific anay site, wherein its complementary DNA is located.

Anays containing double-stranded probe DNA situated thereon are preferably subjected to denaturing conditions to render the DNA single-stranded prior to contacting with the target polynucleotide molecules. Anays containing single-stranded probe DNA (e.g., synthetic oligodeoxyribonucleic acids) may need to be denatured prior to contacting with the target polynucleotide molecules, e.g., to remove hairpins or dimers which form due to self complementary sequences.

Optimal hybridization conditions will depend on the length (e.g., oligomer versus polynucleotide greater than 200 bases) and type (e.g., RNA, or DNA) of probe and target nucleic acids. One of skill in the art will appreciate that as the oligonucleotides become shorter, it may become necessary to adjust their length to achieve a relatively uniform melting temperature for satisfactory hybridization results. General parameters for specific (i.e., stringent) hybridization conditions for nucleic acids are described in Sambrook et al, MOLECULAR CLONING - A LABORATORY MANUAL (2ND ED.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989), and in Ausubel et al, CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, vol. 2, Cunent Protocols Publishing, New York (1994). Typical hybridization conditions for the cDNA microanays of Schena et al. are hybridization in 5 X SSC plus 0.2% SDS at 65 °C for

136 four hours, followed by washes at 25 °C in low stringency wash buffer (1 X SSC plus 0.2% SDS), followed by 10 minutes at 25 °C in higher stringency wash buffer (0.1 X SSC plus 0.2% SDS) (Schena et al, Proc. Natl. Acad. Sci. U.S.A. 93:10614 (1993)). Useful hybridization conditions are also provided in, e.g., Tijessen, 1993, HYBRIDIZATION WITH NUCLEIC ACID PROBES, Elsevier Science Publishers B.V.; and Kricka, 1992, NONISOTOPIC DNA PROBE TECHNIQUES, Academic Press, San Diego, CA.

Particularly prefened hybridization conditions include hybridization at a temperature at or near the mean melting temperature of the probes (e.g., within 5 °C, more preferably within 2 °C) in 1 M NaCI, 50 mM MES buffer (pH 6.5), 0.5% sodium sarcosine and 30% formamide.

5.5.2.6 SIGNAL DETECTION AND DATA ANALYSIS When fluorescently labeled probes are used, the fluorescence emissions at each site of a microanay may be, preferably, detected by scanning confocal laser microscopy. In one embodiment, a separate scan, using the appropriate excitation line, is carried out for each of the two fluorophores used. Alternatively, a laser may be used that allows simultaneous specimen illumination at wavelengths specific to the two fluorophores and emissions from the two fluorophores can be analyzed simultaneously (see Shalon et al, 1996, "A DNA microanay system for analyzing complex DNA samples using two-color fluorescent probe hybridization," Genome Research 6:639-645, which is incorporated by reference in its entirety for all purposes). In a prefened embodiment, the anays are scanned with a laser fluorescent scanner with a computer controlled X-Y stage and a microscope objective. Sequential excitation of the two fluorophores is achieved with a multi-line, mixed gas laser and the emitted light is split by wavelength and detected with two photomultiplier tubes. Fluorescence laser scanning devices are described in Schena et al, Genome Res. 6:639-645 (1996), and in other references cited herein. Alternatively, the fiber-optic bundle described by Ferguson et al, Nature Biotech. 14:1681-1684 (1996), may be used to monitor mRNA abundance levels at a large number of sites simultaneously.

Signals are recorded and, in a prefened embodiment, analyzed by computer, e.g., using a 12 or 16 bit analog to digital board. In one embodiment the scanned image is despeckled using a graphics program (e.g., Hijaak Graphics Suite) and then analyzed using an image gridding program that creates a spreadsheet of the average hybridization at each wavelength at each site. If necessary, an experimentally determined

137 conection for "cross talk" (or overlap) between the channels for the two fluors may be made. For any particular hybridization site on the transcript anay, a ratio of the emission of the two fluorophores can be calculated. The ratio is independent of the absolute expression level of the cognate gene, but is useful for genes whose expression is significantly modulated in association with the different breast cancer-related condition.

5.6 COMPUTER-FACILITATED ANALYSIS The present invention further provides for kits comprising the marker sets above. In a prefened embodiment, the kit contains a microanay ready for hybridization to target polynucleotide molecules, plus software for the data analyses described above. The analytic methods described in the previous sections can be implemented by use of the following computer systems and according to the following programs and methods. A computer system comprises internal components linked to external components. The internal components of a typical computer system include a processor element interconnected with a main memory. For example, the computer system can be an Intel 8086-, 80386-, 80486-, Pentium™, or Pentium™-based processor with preferably 32 MB or more of main memory. The computer system may also be a Macintosh or a Macintosh-based system, but may also be a minicomputer or mainframe.

The external components may include mass storage. This mass storage can be one or more hard disks (which are typically packaged together with the processor and memory). Such hard disks are preferably of 1 GB or greater storage capacity. Other external components include a user interface device, which can be a monitor, together with an inputting device, which can be a "mouse", or other graphic input devices, and/or a keyboard. A printing device can also be attached to the computer.

Typically, a computer system is also linked to network link, which can be part of an Ethernet link to other local computer systems, remote computer systems, or wide area communication networks, such as the Internet. This network link allows the computer system to share data and processing tasks with other computer systems.

Loaded into memory during operation of this system are several software components, which are both standard in the art and special to the instant invention. These software components collectively cause the computer system to function according to the methods of this invention. These software components are typically stored on the mass storage device. A software component comprises the operating system, which is responsible for managing computer system and its network interconnections. This

138 operating system can be, for example, of the Microsoft Windows® family, such as Windows 3.1, Windows 95, Windows 98, Windows 2000, or Windows NT, or may be of the Macintosh OS family, or may be UNIX or an operating system specific to a minicomputer or mainframe. The software component represents common languages and functions conveniently present on this system to assist programs implementing the methods specific to this invention. Many high or low level computer languages can be used to program the analytic methods of this invention. Instructions can be interpreted during run-time or compiled. Prefened languages include C/C++, FORTRAN and JAVA. Most preferably, the methods of this invention are programmed in mathematical software packages that allow symbolic enfry of equations and high-level specification of processing, including some or all of the algorithms to be used, thereby freeing a user of the need to procedurally program individual equations or algorithms. Such packages include Mathlab from Mathworks (Natick, MA), Mathematica® from Wolfram Research (Champaign, IL), or S-Plus® from Math Soft (Cambridge, MA). Specifically, the software component includes the analytic methods of the invention as programmed in a procedural language or symbolic package.

The software to be included with the kit comprises the data analysis methods of the invention as disclosed herein. In particular, the software may include mathematical routines for marker discovery, including the calculation of similarity values between clinical categories (e.g., ER status) and marker expression. The software may also include mathematical routines for calculating the similarity between sample marker expression and confrol marker expression, using anay-generated fluorescence data, to determine the clinical classification of a sample.

Additionally, the software may also include mathematical routines for determining the prognostic outcome, and recommended therapeutic regimen, for a particular breast cancer patient. Such software would include instructions for the computer system's processor to receive data structures that include the level of expression of five or more of the marker genes listed in Table 5 in a breast cancer tumor sample obtained from the breast cancer patient; the mean level of expression of the same genes in a control or template; and the breast cancer patient's clinical information, including lymph node and ER status. The software may additionally include mathematical routines for transforming the hybridization data and for calculating the similarity between the expression levels for the marker genes in the patient's breast cancer tumor sample and the confrol or template, hi a specific embodiment, the software includes mathematical

139 routines for calculating a similarity metric, such as a coefficient of conelation, representing the similarity between the expression levels for the marker genes in the patient's breast cancer tumor sample and the control or template, and expressing the similarity as that similarity metric. The software would include decisional routines that integrate the patient's clinical and marker gene expression data, and recommend a course of therapy. In one embodiment, for example, the software causes the processor unit to receive expression data for the patient's tumor sample, calculate a metric of similarity of these expression values to the values for the same genes in a template or control, compare this similarity metric to a pre-selected similarity metric threshold or thresholds that differentiate prognostic groups, assign the patient to the prognostic group, and, on the basis of the prognostic group, assign a recommended therapeutic regimen, hi a specific example, the software additionally causes the processor unit to receive data structures comprising clinical information about the breast cancer patient. In a more specific example, such clinical information includes the patient's age, stage of breast cancer, estrogen receptor status, and lymph node status.

Where the control is an expression template comprising expression values for marker genes within a group of breast cancer patients, the control can comprise either hybridization data obtained at the same time (i.e., in the same hybridization experiment) as the patient's individual hybridization data, or can be a set of hybridization or marker expression values stores on a computer, or on computer-readable media. If the latter is used, new patient hybridization data for the selected marker genes, obtained from initial or follow-up tumor samples, or suspected tumor samples, can be compared to the stored values for the same genes without the need for additional control hybridizations. However, the software may additionally comprise routines for updating the control data set, i.e., to add information from additional breast cancer patients or to remove existing members of the control data set, and, consequently, for recalculating the average expression level values that comprise the template. In another specific embodiment, said control comprises a set of single-channel mean hybridization intensity values for each of said at least five of said genes, stored on a computer-readable medium.

Clinical data relating to a breast cancer patient, and used by the computer program products of the invention, can be contained in a database of clinical data in which information on each patient is maintained in a separate record, which record may contain any information relevant to the patient, the patient's medical history, treatment,

140 prognosis, or participation in a clinical trial or study, including expression profile data generated as part of an initial diagnosis or for tracking the progress of the breast cancer during treatment.

Thus, one embodiment of the invention provides a computer program product for classifying a breast cancer patient according to prognosis, the computer program product for use in conjunction with a computer having a memory and a processor, the computer program product comprising a computer readable storage medium having a computer program mechanism encoded thereon, wherein said computer program product can be loaded into the one or more memory units of a computer and causes the one or more processor units of the computer to execute the steps of (a) receiving a first data structure comprising the level of expression of at least five of the genes for which markers are listed in Table 5 in a cell sample taken from said breast cancer patient; (b) determining the similarity of the level of expression of said at least five genes to control levels of expression of said at least five genes to obtain a patient similarity value; (c) comparing said patient similarity value to selected first and second threshold values of similarity of said level of expression of said genes to said control levels of expression to obtain first and second similarity threshold values, respectively, wherein said second similarity threshold indicates greater similarity to said control levels of expression than does said first similarity threshold; and (d) classifying said breast cancer patient as having a first prognosis if said patient similarity value exceeds said first and said second threshold similarity values, a second prognosis if said patient similarity value exceeds said first threshold similarity value but does not exceed said second threshold similarity value, and a third prognosis if said patient similarity value does not exceed said first threshold similarity value or said second threshold similarity value. In a specific embodiment of said computer program product, said first threshold value of similarity and said second threshold value of similarity are values stored in said computer. In another more specific embodiment, said first prognosis is a "very good prognosis," said second prognosis is an "intermediate prognosis," and said third prognosis is a "poor prognosis," and wherein said computer program mechanism may be loaded into the memory and further cause said one or more processor units of said computer to execute the step of assigning said breast cancer patient a therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or comprising chemotherapy if said patient has any other combination of lymph node status and expression profile. In another

141 specific embodiment, said computer program mechanism may be loaded into the memory and further cause said one or more processor units of the computer to execute the steps of receiving a data structure comprising clinical data specific to said breast cancer patient. In a more specific embodiment, said clinical data includes the lymph node and estrogen receptor (ER) status of said breast cancer patient. In more specific embodiment, said single-channel hybridization intensity values are log transformed. The computer implementation of the method, however, may use any desired transformation method. In another specific embodiment, the computer program product causes said processing unit to perform said comparing step (c) by calculating the difference between the level of expression of each of said genes in said cell sample taken from said breast cancer patient and the level of expression of the same genes in said control. In another specific embodiment, the computer program product causes said processing unit to perform said comparing step (c) by calculating the mean log level of expression of each of said genes in said control to obtain a control mean log expression level for each gene, calculating the log expression level for each of said genes in a breast cancer sample from said breast cancer patient to obtain a patient log expression level, and calculating the difference between the patient log expression level and the control mean log expression for each of said genes. In another specific embodiment, the computer program product causes said processing unit to perfonn said comparing step (c) by calculating similarity between the level of expression of each of said genes in said cell sample taken from said breast cancer patient and the level of expression of the same genes in said control, wherein said similarity is expressed as a similarity value. In more specific embodiment, said similarity value is a conelation coefficient. The similarity value may, however, be expressed as any art-known similarity metric. hi an exemplary implementation, to practice the methods of the present invention, a user first loads experimental data into the computer system. These data can be directly entered by the user from a monitor, keyboard, or from other computer systems linked by a network connection, or on removable storage media such as a CD-ROM, floppy disk (not illustrated), tape drive (not illustrated), ZEP® drive (not illustrated) or through the network. Next the user causes execution of expression profile analysis software which performs the methods of the present invention.

In another exemplary implementation, a user first loads experimental data and/or databases into the computer system. This data is loaded into the memory from the storage media or from a remote computer, preferably from a dynamic geneset database

142 system, through the network. Next the user causes execution of software that performs the steps of the present invention.

Additionally, because the data obtained and analyzed in the software and computer system products of the invention are confidential, the software and/or computer system comprises access controls or access control routines, such as

Alternative computer systems and software for implementing the analytic methods of this invention will be apparent to one of skill in the art and are intended to be comprehended within the accompanying claims. In particular, the accompanying claims are intended to include the alternative program structures for implementing the methods of this invention that will be readily apparent to one of skill in the art.

6. EXAMPLES

Materials And Methods

117 tumor samples from breast cancer patients were collected. RNA samples were then prepared, and each RNA sample was profiled using inkjet-printed microanays. Marker genes were then identified based on expression patterns; these genes were then used to train classifiers, which used these marker genes to classify tumors into diagnostic and prognostic categories. Finally, these marker genes were used to predict the diagnostic and prognostic outcome for a group of individuals.. 1. Sample collection 117 breast cancer patients treated at The Netherlands Cancer Institute /

Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands, were selected on the basis of the following clinical criteria (data extracted from the medical records of the NKI/AvL Tumor Register, Biometrics Department).

Group 1 (n=97, 78 for training, 19 for independent tests) was selected on the basis of: (1) primary invasive breast carcinoma <5 cm (Tl or T2); (2) no axillary metastases (NO); (3) age at diagnosis <55 years; (4) calender year of diagnosis 1983- 1996; and (5) no prior malignancies (excluding carcinoma in situ of the cervix or basal cell carcinoma of the skin). All patients were treated by modified radical mastectomy (n=34) or breast conserving treatment (n=64), including axillary lymph node dissection. Breast conserving treatment consisted of excision of the tumor, followed by radiation of the whole breast to a dosis of 50 Gy, followed by a boost varying from 15 to 25 Gy. Five patients received adjuvant systemic therapy consisting of chemotherapy (n=3) or hormonal therapy (n=2), all other patients did not receive additional treatment. All

143 patients were followed at least annually for a period of at least 5 years. Patient follow-up information was extracted from the Tumor Registry of the Biometrics Department. Group 2 (n=20) was selected as: (1) carriers of a germline mutation in

BRCAl or BRCA2; and (2) having primary invasive breast carcinoma. No selection or exclusion was made based on tumor size, lymph node status, age at diagnosis, calender year of diagnosis, other malignancies. Germline mutation status was known prior to this research protocol.

Information about individual from which tumor samples were collected include: year of birth; sex; whether the individual is pre- or post-menopausal; the year of diagnosis; the number of positive lymph nodes and the total number of nodes; whether there was surgery, and if so, whether the surgery was breast-conserving or radical; whether there was radiotherapy, chemotherapy or hormonal therapy. The tumor was graded according to the fonnula P=TNM, where T is the tumor size (on a scale of 0-5); N is the number of nodes that are positive (on a scale of 0-4); and M is metastases (0 = absent, 1 = present). The tumor was also classified according to stage, tumor type (in situ or invasive; lobular or ductal; grade) and the presence or absence of the estrogen and progesterone receptors. The progression of the cancer was described by (where applicable): distant metastases; year of distant metastases, year of death, year of last follow-up; and BRCAl genotype. 2. Tumors:

Germline mutation testing of BRCAl and BRCA2 on DNA isolated from peripheral blood lymphocytes includes mutation screening by a Protein Truncation Test

(PTT) of exon 11 of BRCAl and exon 10 and 11 of BRCAl, deletion PCR of BRCAl genomic deletion of exon 13 and 22, as well Denaturing Gradient Gel Electrophoresis (DGGE) of the remaining exons. Abenant bands were all confirmed by genomic sequencing analyzed on a ABI3700 automatic sequencer and confirmed on a independent

DNA sample.

From all, tumor material was snap frozen in liquid nitrogen within one hour after surgery. Of the frozen tumor material an H&E (hematoxylin-eosin) stained section was prepared prior to and after cutting slides for RNA isolation. These H&E frozen sections were assessed for the percentage of tumor cells; only samples with >50% tumor cells were selected for further study.

For all tumors, surgical specimens fixed in formaldehyde and embedded in paraffin were evaluated according to standard histopathological procedures. H&E stained

144 paraffin sections were examined to assess tumor type (e.g., ductal or lobular according to the WHO classification); to assess histologic grade according the method described by Elston and Ellis (grade 1-3); and to assess the presence of lymphangio-invasive growth and the presence of an extensive lymphocytic infiltrate. All histologic factors were independently assessed by two pathologists (MV and JL); consensus on differences was reached by examining the slides together. A representative slide of each tumor was used for immunohistochemical staining with antibodies directed against the estrogen- and progesterone receptor by standard procedures. The staining result was scored as the percentage of positively staining nuclei (0%, 10%, 20%, etc., up to 100%). 3. Amplification, labeling, and hybridization

The outline for the production of marker-derived nucleic acids and hybridization of the nucleic acids to a microanay are outlined in FIG. 2. 30 frozen sections of 30 μM thickness were used for total RNA isolation of each snap frozen tumor specimen. Total RNA was isolated with RNAzol™ B (Campro Scientific, Veenendaal, The Netherlands) according to the manufacturers protocol, including homogenization of the tissue using a Polytron PT-MR2100 (Merck, Amsterdam, The Netherlands) and finally dissolved in RNAse-free H₂O. The quality of the total RNA was assessed by A260/A280 ratio and had to be between 1.7 and 2.1 as well as visual inspection of the RNA on an agarose gel which should indicate a stronger 28S ribosomal RNA band compared to the 18S ribosomal RNA band, subsequently, 25μg of total RNA was DNase treated using the Qiagen RNASE-free DNase kit and RNeasy spin columns (Qiagen Inc, GmbH, Germany) according to the manufacturers protocol. DNase treated total RNA was dissolved in RNASE-free H₂O to a final concentration of 0.2μg/μl.

5μg total RNA was used as input for cRNA synthesis. An oligo-dT primer containing a T7 RNA polymerase promoter sequence was used to prime first strand cDNA synthesis, and random primers (pdN6) were used to prime second strand cDNA synthesis by MMLV reverse transcriptase. This reaction yielded a double-stranded cDNA that contained the T7 RNA polymerase (T7RNAP) promoter. The double- stranded cDNA was then transcribed into cRNA by T7RNAP. cRNA was labeled with Cy3 or Cy5 dyes using a two-step process. First, allylamine-derivatized nucleotides were enzymatically incorporated into cRNA products. For cRNA labeling, a 3:1 mixture of 5-(3-Aminoallyl)uridine 5 '-triphosphate (Sigma) and UTP was substituted for UTP in the in vitro transcription (EVT) reaction. Allylamine- derivatized cRNA products were then reacted with N-hydroxy succinirnide esters of Cy3

145 or Cy5 (CyDye, Amersham Phamiacia Biotech). 5 ιg Cy5-labeled cRNA from one breast cancer patient was mixed with the same amount of Cy3 -labeled product from a pool of equal amount of cRNA from each individual sporadic patient.

Microanay hybridizations were done in duplicate with fluor reversals. Before hybridization, labeled cRNAs were fragmented to an average size of ~50-100nt by heating at 60°C in the presence of 10 mM ZnCl . Fragmented cRNAs were added to hybridization buffer containing 1 M NaCI, 0.5% sodium sarcosine and 50 mM MES, pH 6.5, which stringency was regulated by the addition of formamide to a final concentration of 30%o. Hybridizations were carried out in a final volume of 3 ml at 40°C on a rotating platform in a hybridization oven (Robbins Scientific) for 48h. After hybridization, slides were washed and scanned using a confocal laser scanner (Agilent Technologies). Fluorescence intensities on scanned images were quantified, normalized and conected.

4. Pooling of samples

The reference cRNA pool was formed by pooling equal amount of cRNAs from each individual sporadic patient, for a total of 78 tumors.

5. 25k human microanay

Surface-bound oligonucleotides were synthesized essentially as proposed by Blanchard et al, Biosens. Bioelectron. 6(7):687-690 (1996); see also Hughes et al., Nature Biotech. 19(4):342-347 (2000). Hydrophobic glass surfaces (3 inches by 3 inches) containing exposed hydroxyl groups were used as substrates for nucleotide synthesis. Phosphoramidite monomers were delivered to computer-defined positions on the glass surfaces using ink-jet printer heads. Unreacted monomers were then washed away and the ends of the extended oligonucleotides were deprotected. This cycle of monomer coupling, washing and deprotection was repeated for each desired layer of nucleotide synthesis. Oligonucleotide sequences to be printed were specified by computer files. Microanays containing approximately 25,000 human gene sequences (Hu25K microanays) were used for this study. Sequences for microanays were selected from RefSeq (a collection of non-redundant mRNA sequences, located on the Internet at nlm.nih.gov/LocusLink refseq.html) and Phil Green EST contigs, which is a collection of EST contigs assembled by Dr. Phil Green et al at the University of Washington (Ewing and Green, Nat. Genet. 25(2):232-4 (2000)), available on the Internet at phrap.org/est_assembly/ index.html. Each mRNA or EST contig was represented on Hu25K microanay by a single 60mer oligonucleotide essentially as described in Hughes et al, Nature Biotech. 19(4):342-347 and in International Publication WO 01/06013,

146 published January 25, 2001, and in International Publication WO 01/05935, published January 25, 2001, except that the rules for oligo screening were modified to remove oligonucleotides with more than 30%C or with 6 or more contiguous C residues.

Example 1 : Differentially regulated gene sets and overall expression patterns of breast cancer tumors Of the approximately 25,000 sequences represented on the microanay, a group of approximately 5,000 genes that were significantly regulated across the group of samples was selected. A gene was determined to be significantly differentially regulated with cancer of the breast if it showed more than two-fold of transcript changes as compared to a sporadic tumor pool, and if the p-value for differential regulation (Hughes et al, Cell 102:109-126 (2000)) was less than 0.01 either upwards or downwards in at least five out of 98 tumor samples. An unsupervised clustering algorithm allowed us to cluster patients based on their similarities measured over this set of ~5,000 significant genes. The similarity between two patients x and y is defined as

S = l- Equation (5)

In Equation (5), x and y are two patients with components of log ration,, and y_t, i = 1, ..., N =5, 100. Associated with every value x. is enor σ . The smaller the value σ_Xι ,

the more reliable the measurement x. . - x = _ ^N"\— x y- / I ∑ ^Ny — 1 is the enor- weighted i=l ^σx, I W ^σx, arithmetic mean. The use of conelation as similarity metric emphasizes the importance of co-regulation in clustering rather than the amplitude of regulations.

The set of approximately 5,000 genes can be clustered based on their similarities measured over the group of 98 tumor samples. The similarity between two genes was defined in the same way as in Equation (5) except that now for each gene, there are 98 components of log ratio measurements.

The result of such a two-dimensional clustering is displayed in FIG 3. Two distinctive patterns emerge from the clustering. The first pattern consists of a group

147 of patients in the lower part of the plot whose regulations are very different from the sporadic pool. The other pattern is made of a group of patients in the upper part of the plot whose expressions are only moderately regulated in comparison with the sporadic pool. These dominant patterns suggest that the tumors can be unambiguously divided into two distinct types based on this set of -5,000 significant genes.

To help understand these patterns, they were associated with estrogen- receptor (ER), proesfrogen receptor (PR), tumor grade, presence of lymphocytic infiltrate, and angioinvasion (FIG. 3). The lower group in FIG 3, which features the dominant pattern, consists of 36 patients. Of the 39 ER-negative patients, 34 patients are clustered together in this group. From FIG. 4, it was observed that the expression of estrogen receptor alpha gene ESRl and a large group of co-regulated genes are consistent with this expression pattern.

From FIG. 3 and FIG. 4, it was concluded that gene expression patterns can be used to classify tumor samples into subgroups of diagnostic interest. Thus, genes co-regulated across 98 tumor samples contain information about the molecular basis of breast cancers. The combination of clinical data and microanay measured gene abundance of ESRl demonstrates that the distinct types are related to, or at least are reported by, the ER status. Example 2: Identification of Genetic Markers Distinguishing Estrogen Receptor (+) From Estrogen Receptor (-) Patients

The results described in this Example allow the identification of expression marker genes that differentiate two major types of tumor cells: "ER-negative" group and "ER-positive" group. The differentiation of samples by ER(+) status was accomplished in four steps: (1) identification of a set of candidate marker genes that conelate with ER level; (2) rank-ordering these candidate genes by strength of conelation; (3) optimization of the number of marker genes; and (4) classifying samples based on these marker genes.

1. Selection of candidate discriminating genes hi the first step, a set of candidate discriminating genes was identified based on gene expression data of training samples. Specifically, we calculated the conelation coefficients β between the category numbers or ER level and logarithmic expression ratio r across all the samples for each individual gene: = (c « r)/d|c|| .|r|) Equation (2)

148 The histogram of resultant conelation coefficients is shown in FIG. 5 A as a gray line. While the amplitude of conelation or anti-conelation is small for the majority of genes, the amplitude for some genes is as great as 0.5. Genes whose expression ratios either conelate or anti-conelate well with the diagnostic category of interest are used as reporter genes for the category.

Genes having a conelation coefficient larger than 0.3 ("conelated genes") or less than -0.3 ("anti-conelated genes") were selected as reporter genes. The threshold of 0.3 was selected based on the conelation distribution for cases where there is no real conelation (one can use permutations to determine this distribution). Statistically, this distribution width depends upon the number of samples used in the conelation calculation. The distribution width for control cases (no real conelation) is approximately 1/ jn - 3 , where n = the number of samples. In our case, n = 98. Therefore, a threshold of 0.3 roughly conesponds to 3 - σ in the distribution (3 X

2,460 such genes were found to satisfy this criterion. In order to evaluate the significance of the conelation coefficient of each gene with the ER level, a bootstrap technique was used to generate Monte-Carlo data that randomize the association between gene expression data of the samples and their categories. The distribution of conelation coefficients obtained from one Monte-Carlo trial is shown as a dashed line in FIG 5 A. To estimate the significance of the 2,460 marker genes as a group, 10,000 Monte-Carlo runs were generated. The collection of 10,000 such Monte-Carlo trials forms the null hypothesis. The number of genes that satisfy the same criterion for Monte-Carlo data varies from ran to run. The frequency distribution from 10,000 Monte-Carlo runs of the number of genes having conelation coefficients of >0.3 or <-0.3 is displayed in FIG. 5B. Both the mean and maximum value are much smaller than 2,460. Therefore, the significance of this gene group as the discriminating gene set between ER(+) and ER(-) samples is estimated to be greater than 99.99%.

2. Rank-ordering of candidate discriminating genes

In the second step, genes on the candidate list were rank-ordered based on the significance of each gene as a discriminating gene. The markers were rank-ordered either by amplitude of conelation, or by using a metric similar to a Fisher statistic:

, Equation (3)

149 hi Equation (3), (x_x) is the enor- weighted average of log ratio within the ER(-), and (x₂) is the enor-weighted average of log ratio within the ER(+) group. σ_λ is the variance of log ratio within the ER(-) group and n_x is the number of samples that had valid measurements of log ratios. σ₂ is the variance of log ratio within the ER(+) group and n₂ is the number of samples that had valid measurements of log ratios. The t- value in Equation (3) represents the variance-compensated difference between two means. The confidence level of each gene in the candidate list was estimated with respect to a null hypothesis derived from the actual data set using a bootstrap technique; that is, many artificial data sets were generated by randomizing the association between the clinical data and the gene expression data.

3. Optimization of the number of marker genes

The leave-one-out method was used for cross validation in order to optimize the discriminating genes. For a set of marker genes from the rank-ordered candidate list, a classifier was trained with 97 samples, and was used to predict the status of the remaining sample. The procedure was repeated for each of the samples in the pool, and the number of cases where the prediction for the one left out is wrong or conect was counted.

The above perfonnance evaluation from leave-one-out cross validation was repeated by successively adding more marker genes from the candidate list. The performance as a function of the number of marker genes is shown in FIG. 6. The enor rates for type 1 and type 2 enors varied with the number of marker genes used, but were both minimal while the number of the marker genes is around 550. Therefore, we consider this set of 550 genes is considered the optimal set of marker genes that can be used to classify breast cancer tumors into "ER-negative" group and "ER-positive" group. FIG. 7 shows the classification of patients as ER(+) or ER(-) based on this 550 marker set. FIG. 8 shows the conelation of each tumor to the ER-negative template versus the conelation of each tumor to the ER-positive template.

4. Classification based on marker genes

In the third step, a set of classifier parameters was calculated for each type of training data set based on either of the above ranking methods. A template for the

ER(-) group (z ) was generated using the enor-weighted log ratio average of the selected group of genes. Similarly, a template for ER(+) group (called z₂ ) was generated using

150 the enor-weighted log ratio average of the selected group of genes. Two classifier parameters (P_x and P₂) were defined based on either conelation or distance. P_x measures the similarity between one sample y and the ER(-) template z_x over this selected group of genes. P₂ measures the similarity between one sample y and the ER(+) template z₂ over this selected group of genes. The conelation P_t is defined as:

' ' Equation (1)

A "leave-one-out" method was used to cross-validate the classifier built based on the marker genes. In this method, one sample was reserved for cross validation each time the classifier was trained. For the set of 550 optimal marker genes, the classifier was trained with 97 of the 98 samples, and the status of the remaining sample was predicted. This procedure was performed with each of the 98 patients. The number of cases where the prediction was wrong or conect was counted. It was further determined that subsets of as few as ~50 of the 2,460 genes are able classify tumors as ER(+) or ER(-) nearly as well as using the total set. In a small number of cases, there was disagreement between classification by the 550 marker set and a clinical classification. In comparing the microanay measured log ratio of expression for ESRl to the clinical binary decision (negative or positive) of ER status for each patient, it was seen that the measured expression is consistent with the qualitative category of clinical measurements (mixture of two methods) for the majority of tumors. For example, two patients who were clinically diagnosed as ER(+) actually exhibited low expression of ESRl from microanay measurements and were classified as ER negative by 550 marker genes. Additionally, 3 patients who were clinically diagnosed as ER(-) exhibited high expression of ESRl from microanay measurements and were classified as ER(+) by the same 550 marker genes. Statistically, however, microarray measured gene expression of ESRl conelates with the dominant pattens better than clinically determined ER status.

Example 3 : Identification of Genetic Markers Distinguishing BRCAl Tumors From Sporadic Tumors in Estrogen Receptor (-) Patients The BRCAl mutation is one of the major clinical categories in breast cancer tumors. It was determined that of tumors of 38 patients in the ER(-) group, 17 exhibited the BRCAl mutation, while 21 were sporadic tumors. A method was therefore

151 developed that enabled the differentiation of the 17 BRCAl mutation tumors from the 21 sporadic tumors in the ER(-) group.

1. Selection of candidate discriminating genes

In the first step, a set of candidate genes was identified based on the gene expression patterns of these 38 samples. We first calculated the conelation between the i?RC47-mutation category number and the expression ratio across all 38 samples for each individual gene by Equation (2). The distribution of the conelation coefficients is shown as a histogram defined by the solid line in FIG. 9 A. We observed that, while the majority of genes do not conelate with BRCAl mutation status, a small group of genes conelated at significant levels. It is likely that genes with larger conelation coefficients would serve as reporters for discriminating tumors of BRCAl mutation carriers from sporadic tumors within the ER(-) group.

In order to evaluate the significance of each conelation coefficient with respect to a null hypothesis that such conelation coefficient could be found by chance, a bootstrap technique was used to generate Monte-Carlo data that randomizes the association between gene expression data of the samples and their categories. 10,000 such Monte-Carlo runs were generated as a control in order to estimate the significance of the marker genes as a group. A threshold of 0.35 in the absolute amplitude of conelation coefficients (either conelation or anti-conelation) was applied both to the real data and the Monte-Carlo data. Following this method, 430 genes were found to satisfy this criterion for the experimental data. The p-value of the significance, as measured against the 10,000 Monte-Carlo trials, is approximately 0.0048 (FIG. 9B). That is, the probability that this set of 430 genes contained useful information about BRCAl -like tumors vs sporadic tumors exceeds 99%. 2. Rank-ordering of candidate discriminating genes hi the second step, genes on the candidate list were rank-ordered based on the significance of each gene as a discriminating gene. Here, we used the absolute amplitude of conelation coefficients to rank order the marker genes. 3 Optimization of discriminating genes In the third step, a subset of genes from the top of this rank-ordered list was used for classification. We defined a BRCAl group template (called z_x ) by using the enor-weighted log ratio average of the selected group of genes. Similarly, we defined a non-BRCAl group template (called z₂ ) by using the enor-weighted log ratio average of

152 the selected group of genes. Two classifier parameters (PI and P2) were defined based on either conelation or distance. PI measures the similarity between one sample y and the BRCAl template z_x over this selected group of genes. P2 measures the similarity between one sample y and the non-BRCAl template z₂ over this selected group of genes. For conelation, PI and P2 were defined in the same way as in Equation (4). The leave-one-out method was used for cross validation in order to optimize the discriminating genes as described in Example 2. For a set of marker genes from the rank-ordered candidate list, the classifier was trained with 37 samples the remaining one was predicted. The procedure was repeated for all the samples in the pool, and the number of cases where the prediction for the one left out is wrong or conect was counted.

To determine the number of markers constituting a viable subset, the above performance evaluation from leave-one-out cross validation was repeated by cumulatively adding more marker genes from the candidate list. The performance as a function of the number of marker genes is shown in FIG. 10. The enor rates for type 1 (false negative) and type 2 (false positive) enors (Bendat & Piersol, RANDOM DATA ANALYSIS AND MEASUREMENT PROCEDURES, 2D ED., Wiley Interscience, p. 89) reached optimal ranges when the number of the marker genes is approximately 100. Therefore, a set of about 100 genes is considered to be the optimal set of marker genes that can be used to classify tumors in the ER(-) group as either BRCAl-te ated tumors or sporadic tumors.

The classification results using the optimal 100 genes are shown in FIGS. 11 A and 1 IB. As shown in Figure 11 A, the co-regulation patterns of the sporadic patients differ from those of the BRCAl patients primarily in the amplitude of regulation. Only one sporadic tumor was classified into the BRCAl group. Patients in the sporadic group are not necessarily BRCAl mutation negative; however, it is estimated that only approximately 5% of sporadic tumors are indeed BRCAl -mutation carriers.

Example 4: Identification of Genetic Markers Distinguishing Sporadic Tumor Patients with >5 Year Versus <5 Year Survival Times 78 tumors from sporadic breast cancer patients were used to explore prognostic predictors from gene expression data. Of the 78 samples in this sporadic breast cancer group, 44 samples were known clinically to have had no distant metastases within 5 years since the initial diagnosis ("no distant metastases group") and 34 samples

153 had distant metastases within 5 years since the initial diagnosis ("distant metastases group"). A group of 231 markers, and optimally a group of 70 markers, was identified that allowed differentiation between these two groups.

1. Selection of candidate discriminating genes In the first step, a set of candidate discriminating genes was identified based on gene expression data of these 78 samples. The conelation between the prognostic category number (distant metastases vs no distant metastases) and the logarithmic expression ratio across all samples for each individual gene was calculated_, using Equation (2). The distribution of the conelation coefficients is shown as a solid line in FIG. 12 A. FIG. 12A also shows the result of one Monte-Carlo run as a dashed line. We observe that even though the majority of genes do not conelate with the prognostic categories, a small group of genes do conelate. It is likely that genes with larger conelation coefficients would be more useful as reporters for the prognosis of interest - distant metastases group and no distant metastases group. In order to evaluate the significance of each conelation coefficient with respect to a null hypothesis that such conelation coefficient can be found by chance, we used a bootstrap technique to generate data from 10,000 Monte-Carlo runs as a control (FIG. 12B). We then selected genes that either have the conelation coefficient larger than 0.3 ("conelated genes") or less than -0.3 ("anti-conelated genes"). The same selection criterion was applied both to the real data and the Monte-Carlo data. Using this comparison, 231 markers from the experimental data were identified that satisfy this criterion. The probability of this gene set for discriminating patients between the distant metastases group and the no distant metastases group being chosen by random fluctuation is approximately 0.003. 2. Rank-ordering of candidate discriminating genes hi the second step, genes on the candidate list were rank-ordered based on the significance of each gene as a discriminating gene. Specifically, a metric similar to a "Fisher" statistic, defined in Equation (3), was used for the purpose of rank ordering. The confidence level of each gene in the candidate list was estimated with respect to a null hypothesis derived from the actual data set using the bootstrap technique. Genes in the candidate list can also be ranked by the amplitude of conelation coefficients. 3. Optimization of discriminating genes

In the third step, a subset of 5 genes from the top of this rank-ordered list was selected to use as discriminating genes to classify 78 tumors into a "distant

154 metastases group" or a "no distant metastases group". The leave-one-out method was used for cross validation. Specifically, 77 samples defined a classifier based on the set of selected discriminating genes, and these were used to predict the remaining sample. This procedure was repeated so that each of the 78 samples was predicted. The number of cases in which predictions were conect or inconect were counted. The performance of the classifier was measured by the enor rates of type 1 and type 2 for this selected gene set.

We repeated the above performance evaluation procedure, adding 5 more marker genes each time from the top of the candidate list, until all 231 genes were used. As shown in FIG. 13, the number of mis-predictions of type 1 and type 2 enors change dramatically with the number of marker genes employed. The combined enor rate reached a minimum when 70 marker genes from the top of our candidate list were used. Therefore, this set of 70 genes is the optimal, prefened set of marker genes useful for the classification of sporadic tumor patients into either the distant metastases or no distant metastases group. Fewer or more markers also act as predictors, but are less efficient, either because of higher enor rates, or the introduction of statistical noise. 4. Reoccunence probability curves

The prognostic classification of 78 patients with sporadic breast cancer tumors into two distinct subgroups was predicted based on their expression of the 70 optimal marker genes (FIGS. 14 and 15).

To evaluate the prognostic classification of sporadic patients, we predicted the outcome of each patient by a classifier trained by the remaining 77 patients based on the 70 optimal marker genes. FIG. 16 plots the distant metastases probability as a function of the time since initial diagnosis for the two predicted groups. The difference between these two reoccunence curves is significant. Using the χ² test (S-PLUS 2000 Guide to Statistics, vol. 2, MathSoft, p. 44), the p-value is estimated to be ~10^"9. The distant metastases probability as a function of the time since initial diagnosis was also compared between ER(+) and ER(-) individuals (FIG. 17), PR(+) and PR(-) individuals (FIG. 18), and between individuals with different tumor grades (FIGS. 19A, 19B). In comparison, the p-values for the differences between two prognostic groups based on clinical data are much less significant than that based on gene expression data, ranging

To parameterize the reoccunence probability as a function of time since initial diagnosis, the curve was fitted to one type of survival model - "normal":

155 P = αX exρ (-fVτ² ) (4)

For fixed 0!= 1, we found that T = 125 months for patients in the no distant metastases group and T = 36 months for patients in the distant metastases group. Using tumor grades, we found T = 100 months for patients with tumor grades 1 and 2 and T = 60 for patients with tumor grade 3. It is accepted clinical practice that tumor grades are the best available prognostic predictor. However, the difference between the two prognostic groups classified based on 70 marker genes is much more significant than those classified by the best available clinical information.

5. Prognostic prediction for 19 independent sporadic tumors To confirm the proposed prognostic classification method and to ensure the reproducibility, robustness, and predicting power of the 70 optimal prognostic marker genes, we applied the same classifier to 19 independent tumor samples from sporadic breast cancer patients, prepared separately at The Netherlands Cancer Institute (NKI). The same reference pool was used. The classification results of 19 independent sporadic tumors are shown in

Figure 20. FIG. 20A shows the log ratio of expression regulation of the same 70 optimum marker genes. Based on our classifier model, we expected the misclassification of 19*(6+7)/78 = 3.2 tumors. Consistently, (1+3) = 4 of 19 tumors were misclassified.

6. Clinical parameters as a group vs. microanay data - Results of logistic regression

In the previous section, the predictive power of each individual clinical parameter was compared with that of the expression data. However, it is more meaningful to combine all the clinical parameters as a group, and then compare them to the expression data. This requires multi-variant modeling; the method chosen was logistic regression. Such an approach also demonstrates how much improvement the microanay approach adds to the results of the clinical data.

The clinical parameters used for the multi-variant modeling were: (1) tumor grade; (2) ER status; (3) presence or absence of the progestogen receptor (PR); (4) tumor size; (5) patient age; and (6) presence or absence of angioinvasion. For the microanay data, two conelation coefficients were used. One is the conelation to the mean of the good prognosis group (CI) and the other is the conelation to the mean of the bad prognosis group (C2). When calculating the conelation coefficients for a given patient, this patient is excluded from either of the two means.

156 The logistic regression optimizes the coefficient of each input parameter to best predict the outcome of each patient. One way to judge the predictive power of each input parameter is by how much deviance (similar to Chi-square in the linear regression, see for example, Hasomer & Lemeshow, APPLIED LOGISTIC REGRESSION, John Wiley & Sons, (2000)) the parameter accounts for. The best predictor should account for most of the deviance. To fairly assess the predictive power, each parameter was modeled independently. The microanay parameters explain most of the deviance, and hence are powerful predictors.

The clinical parameters, and the two microanay parameters, were then monitored as a group. The total deviance explained by the six clinical parameters was 31.5, and total deviance explained by the microanay parameters was 39.4. However, when the clinical data was modeled first, and the two microanay parameters added, the final deviance accounted for is 57.0.

The logistic regression computes the likelihood that a patient belongs to the good or poor prognostic group. FIGS. 21 A and 21B show the sensitivity vs. (1- specificity). The plots were generated by varying the threshold on the model predicted likelihood. The curve which goes through the top left corner is the best (high sensitivity with high specificity). The microanay outperfonned the clinical data by a large margin. For example, at a fixed sensitivity of around 80%, the specificity was -80% from the microanay data, and -65% from the clinical data for the good prognosis group. For the poor prognosis group, the conesponding specificities were -80% and ~70%>, again at a fixed sensitivity of 80%>. Combining the microanay data with the clinical data further improved the results. The result can also be displayed as the total enor rate as the function of the threshold in FIG. 21C. At all possible thresholds, the enor rate from the microanay was always smaller than that from the clinical data. By adding the microanay data to the clinical data, the enor rate is further reduced, as one can see in Figure 21C.

Odds ratio tables can be created from the prediction of the logistic regression. The probability of a patient being in the good prognosis group is calculated by the logistic regression based on different combinations of input parameters (clinical and/or microanay). Patients are divided into the following four groups according to the prediction and the true outcome: (1) predicted good and truly good, (2) predicted good but truly poor, (3) predicted poor but truly good, (4) predicted poor and truly poor. Groups (1) & (4) represent conect predictions, while groups (2) & (3) represent mispredictions. The division for the prediction is set at probability of 50%, although other

157 thresholds can be used. The results are listed in Table 8. It is clear from Table 8 that microanay profiling (Table 8.3 & 8.10) outperforms any single clinical data (Table 8.4- 8.9) and the combination of the clinical data (Table 8.2). Adding the micro-anay profiling in addition to the clinical data give the best results (Table 8.1).

For microanay profiling, one can also make a similar table (Table 8.11) without using logistic regression. In this case, the prediction was simply based on C1-C2 (greater than 0 means good prognosis, less than 0 mean poor prognosis).

158

Example 5. Concept of mini-anay for diagnosis purposes.

All genes on the marker gene list for the purpose of diagnosis and prognosis can be synthesized on a small-scale microanay using ink-jet technology. A microanay with genes for diagnosis and prognosis can respectively or collectively be made. Each gene on the list is represented by single or multiple oligonucleotide probes, depending on its sequence uniqueness across the genome. This custom designed mini- anay, in combination with sample preparation protocol, can be used as a diagnostic/prognostic kit in clinics.

Example 6. Biological Significance of diagnostic marker genes The public domain was searched for the available functional annotations for the 430 marker genes for BRCAl diagnosis in Table 3. The 430 diagnostic genes in Table 3 can be divided into two groups: (1) 196 genes whose expressions are highly expressed in BRCAl-li e group; and (2) 234 genes whose expression are highly expressed sporadic group. Of the 196 BRCAl group genes, 94 are annotated. Of the 234 sporadic group genes, 100 are annotated. The tenns "T-cell", " B-cell" or "immunoglobulin" are involved in 13 of the 94 annotated genes, and in 1 of the 100 annotated genes, respectively. Of 24,479 genes represented on the microanays, there are 7,586 genes with annotations to date. "T-cell", B-cell" and "immunoglobulin" are found in 207 of these 7,586 genes. Given this, the p-value of the 13 "T-cell", "B-cell" or "immunoglobulin" genes in the BRCAl group is very significant (p-value = 1.1x10-6). In comparison, the observation of 1 gene relating to "T-cell", "B-cell", or "immunoglobulin" in the sporadic group is not significant (p-value = 0.18). •

The observation that BRCAl patients have highly expressed lymphocyte (T-cell and B-cell) genes agrees with what has been seen from pathology that BRCAl breast tumor has more frequently associated with high lymphocytic infiltration than sporadic cases (Chappuis et al, 2000, Semin Surg Oncol 18:287-295).

159 Example 7. Biological significance of prognosis marker genes

A search was performed for available functional annotations for the 231 prognosis marker genes (Table 5). The markers fall into two groups: (1) 156 markers whose expressions are highly expressed in poor prognostic group; and (2) 75 genes whose expression are highly expressed in good prognostic group. Of the 156 markers, 72 genes are annotated; of the 75 genes, 28 genes are annotated.

Twelve of the 72 markers, but none of the 28 markers, are, or are associated with, kinases. In contrast, of the 7,586 genes on the microanay having annotations to date, only 471 involve kinases. On this basis, the p-value that twelve kinase-related markers in the poor prognostic group is significant (p-value = 0.001).

Kinases are important regulators of intracellular signal transduction pathways mediating cell proliferation, differentiation and apoptosis. Their activity is normally tightly controlled and regulated. Overexpression of certain kinases is well known involving in oncogenesis, such as vascular endothelial growth factor receptorl (VEGFR1 or FLT1), a tyrosine kinase in the poor prognosis group, which plays a very important role in tumor angiogenesis. Interestingly, vascular endothelial growth factor (VEGF), VEGFR's ligand, is also found in the prognosis group, which means both ligand and receptor are upregulated in poor prognostic individuals by an unknown mechanism.

Likewise, 16 of the 72 markers, and only two of the 28 markers, are, or are associated with, ATP-binding or GTP-binding proteins, hi contrast, of the 7,586 genes on the microanay having annotations to date, only 714 and 153 involve ATP-binding and GTP-binding, respectively. On this basis, the p-value that 16 GTP- or ATP-binding- related markers in the poor prognosis group is significant (p-value 0.001 and 0.0038). Thus, the kinase- and ATP- or GTP-binding-related markers within the 72 markers can be used as prognostic indicators.

Cancer is characterized by deregulated cell proliferation. On the simplest level, this requires division of the cell or mitosis. By keyword searching, we found "cell division" or "mitosis" included in the annotations of 7 genes respectively in the 72 annotated markers from the 156 poor prognosis markers, but in none for the 28 annotated genes from 75 good prognosis markers. Of the 7,586 microanay markers with annotations, "cell division" is found in 62 annotations and "mitosis" is found in 37 annotations. Based on these findings, the p-value that seven cell division- or mitosis- related markers are found in the poor prognosis group is estimated to be highly significant (p-value = 3.5xl0^"5). hi comparison, the absence of cell division- or mitosis-related

160 markers in the good prognosis group is not significant (p-value = 0.69). Thus, the seven cell division- or mitosis-related markers may be used as markers for poor prognosis.

Example 8: Construction of an artificial reference pool.

The reference pool for expression profiling in the above Examples was made by using equal amount of cRNAs from each individual patient in the sporadic group, hi order to have a reliable, easy-to-made, and large amount of reference pool, a reference pool for breast cancer diagnosis and prognosis can be constructed using synthetic nucleic acid representing, or derived from, each marker gene. Expression of marker genes for individual patient sample is monitored only against the reference pool, not a pool derived from other patients.

To make the reference pool, 60-mer oligonucleotides are synthesized according to 60-mer ink-jet anay probe sequence for each diagnostic/prognostic reporter genes, then double-stranded and cloned into pBluescript SK- vector (Stratagene, La Jolla, CA), adjacent to the T7 promoter sequence. Individual clones are isolated, and the sequences of their inserts are verified by DNA sequencing. To generate synthetic RNAs, clones are linearized with EcoRI and a T7 in vitro transcription (1YT) reaction is performed according to the MegaScript kit (Ambion, Austin, TX). EVT is followed by DNase treatment of the product. Synthetic RNAs are purified on RNeasy columns (Qiagen, Valencia, CA). These synthetic RNAs are transcribed, amplified, labeled, and mixed together to make the reference pool. The abundance of those synthetic RNAs are adjusted to approximate the abundance of the conesponding marker-derived transcripts in the real tumor pool.

Example 9: Use of single-channel data and a sample pool represented by stored values. 1. Creation of a reference pool of stored values ("mathematical sample pool") The use of ratio-based data used in Examples 1-7, above, requires a physical reference sample. In the above Examples, a pool of sporadic tumor sample was used as the reference. Use of such a reference, while enabling robust prognostic and diagnostic predictions, can be problematic because the pool is typically a limited resource. A classifier method was therefore developed that does not require a physical sample pool, making application of this predictive and diagnostic technique much simpler in clinical applications.

161 To test whether single-channel data could be used, the following procedure was developed. First, the single channel intensity data for the 70 optimal genes, described in Example 4, from the 78 sporadic training samples, described in the Materials and Methods, was selected from the sporadic sample vs. tumor pool hybridization data. The 78 samples consisted of 44 samples from patients having a good prognosis and 34 samples from patients having a poor prognosis. Next, the hybridization intensities for these samples were normalized by dividing by the median intensity of all the biological spots on the same microanay. Where multiple microanays per sample were used, the average was taken across all of the microanays. A log transform was performed on the intensity data for each of the 70 genes, or for the average intensity for each of the 70 genes where more than one microanay is hybridized, and a mean log intensity for each gene across the 78 sporadic samples was calculated. For each sample, the mean log intensities thus calculated were subtracted from the individual sample log intensity. This figure, the mean subtracted log(intensity) was then treated as the two color log(ratio) for the classifier by substitution into Equation (5). For new samples, the mean log intensity is subtracted in the same manner as noted above, and a mean subtracted log(intensity) calculated.

The creation of a set of mean log intensities for each gene hybridized creates a "mathematical sample pool" that replaces the quantity-limited "material sample pool." This mathematical sample pool can then be applied to any sample, including samples in hand and ones to be collected in the future. This "mathematical sample pool" can be updated as more samples become available. 2. Results

To demonstrate that the mathematical sample pool performs a function equivalent to the sample reference pool, the mean-subtracted-log(intensity) (single channel data, relative to the mathematical pool) vs. the log(ratio) (hybridizations, relative to the sample pool) was plotted for the 70 optimal reporter genes across the 78 sporadic samples, as shown in FIG. 22. The ratio and single-channel quantities are highly conelated, indicating both have the capability to report relative changes in gene expression. A classifier was then constracted using the mean-subtracted-log(intensity) following exactly the same procedure as was followed using the ratio data, as in Example 4.

As shown in FIGS. 23A and 23B, single-channel data was successful at classifying samples based on gene expression patterns. FIG. 23A shows samples grouped

162 according to prognosis using single-channel hybridization data. The white line separates samples from patients classified as having poor prognoses (below) and good prognoses (above). FIG. 23B plots each sample as its expression data conelates with the good (open circles) or poor (filled squares) prognosis classifier parameter. Using the "leave-one-out" cross validation method, the classifier predicted 10 false positives out of 44 samples from patients having a good prognosis, and 6 false negatives out of 34 samples from patients having a poor prognosis, where a poor prognosis is considered a "positive." This outcome is comparable to the use of the ratio-based classifier, which predicted 7 out of 44, and 6 out of 34, respectively. In clinical applications, it is greatly preferable to have few false negatives, which results in fewer under-treated patients. To conform the results to this preference, a classifier was constructed by ranking the patient sample according to its coefficient of conelation to the "good prognosis" template, and choosing a threshold for this conelation coefficient to allow approximately 10% false negatives, i.e., classification of a sample from a patient with poor prognosis as one from a patient with a good prognosis. Out of the 34 poor prognosis samples used herein, this represents a tolerance of 3 out of 34 poor prognosis patients classified inconectly. This tolerance limit conesponds to a threshold 0.2727 coefficient of conelation to the "good prognosis" template. Results using this threshold are shown in FIGS. 24A and 24B. FIG. 24A shows single-channel hybridization data for samples ranked according to the coefficients of conelation with the good prognosis classifier; samples classified as "good prognosis" lie above the white line, and those classified as "poor prognosis" lie below. FIG. 24B shows a scatterplot of sample conelation coefficients, with three inconectly classified samples lying to the right of the threshold conelation coefficient value. Using this threshold, the classifier had a false positive rate of 15 out of the 44 good prognosis samples. This result is not very different compared to the enor rate of 12 out of 44 for the ratio based classifier. In summary, the 70 reporter genes cany robust information about prognosis; the single channel data can predict the tumor outcome almost as well as the ratio based data, while being more convenient in a clinical setting.

Example 10: Comparison of predictive power of 70 optimal genes to clinical predictors and development of three prognosis categories

Using inkjet-synthesized oligonucleotide microanays, we have defined a gene expression profile associated with prognosis in breast cancer. To identify this gene

163 expression profile, tumors of less than 5 cm from lymph node negative patients younger than 55 years were used. Surprisingly, a 70 gene-based classifier outperformed all clinical parameters in predicting distant metastases within 5 years. The odds ratio for metastases of the "poor prognosis" versus "good prognosis" signature group based on the gene expression pattern was estimated to be approximately 15 by a cross-validation procedure. Even though these results were highly encouraging, a limitation of this first study was that these results were derived from and tested on two groups of patients which were selected for outcome: one group of patients who developed distant metastases within 5 years and one group of patients who remained disease-free for at least 5 years. To provide a more accurate estimate of the risk of metastases associated with the prognosis signature and to further substantiate that the gene expression profile is a clinically meaningful tool, a cohort of 295 young breast cancer patients including both lymph node negative and positive patients was studied. The findings confirm that the prognosis profile is a more powerful predictor of disease outcome than cunently used criteria.

1. Breast tumor selection criteria

A consecutive series of 295 tumors was selected from The Netherlands Cancer Institute (NKI) fresh-frozen tissue bank according to the following patient selection criteria: primary invasive breast carcinoma less than 5 cm at pathologic examination (pTl or pT2); tumor-negative apical axillary lymph node as determined by a negative infraclavicular lymph node biopsy; age at diagnosis 52 years or younger; calendar year of diagnosis 1984-1995; and no prior malignancies. All patients had been treated by modified radical mastectomy or breast conserving surgery, including axillary lymph node dissection, followed by radiotherapy if indicated. The 295 tumor samples included 151 taken from lymph node negative (pathologic examination pNO) patients and 144 lymph node positive (pN+) patients. Ten of the 151 lymph node negative patients and 120 of the 144 lymph node positive patients had received adjuvant systemic therapy consisting of chemotherapy (n=90), hormonal therapy (n=20), or both (n=20). All patients were followed at least annually for a period of at least 5 years. Patient follow-up information was exfracted from the NKI Medical Registry. Median follow-up of the 207 patients without metastases as first event was 7.8 years (range: 0.05-18.3) versus 2.7 years (0.3-14.0) for the 88 patients with metastasis as first event during follow-up. For all 295 patients median follow-up is 6.7 years (0.05-18.3). There were no missing data. This

164 study was approved by the Medical Ethical Committee of the Netherlands Cancer Institute.

Clinicopathological parameters were determined as described in Materials and Methods, above. Estrogen receptor (ER) expression was estimated by hybridization intensity obtained from microanay experiments. Using this assay, it was determined that the cohort of 295 tumor samples includes 69 ER negative (ER_ log₁₀ intensity ratio below -0.65 units, conesponding to less than 10% nuclei with positive staining by immunohistochemistry) and 226 ER positive tumors. Histological grade was assessed using the method described by Elston and Ellis, Histopathol. 19(5):403-410 (1991). Vascular invasion was assessed as none (-); minor (1-3 vessels; +/-); major (>3 vessels).

2. RNA isolation and microanay expression profiling

RNA isolation, cRNA labeling, the 25K oligonucleotide microanays, and hybridization experiments were as described in Materials and Methods. The statistical enor model that assigns p values to expression ratios was as described in Example 4. After hybridization, slides were washed and scanned using a confocal laser scanner (Agilent Technologies) (see Hughes et al, Nat. Biotechnol. 19(4):342-347 (2001)).

3. Conelation of the microanay data with the previously determined prognosis profile

The prognostic value of the gene expression profile in a consecutive series of breast cancer patients was determined using the 70 marker genes identified in the experiments described in Example 4. To acquire this consecutive series, 61 of the pNO patients that were also part of the training series used for the construction of the 70-gene prognosis profile were also included. Leaving out these patients would have resulted in selection bias, because the first series contained a disproportionally large number of patients who developed distant metastases within 5 years. For each of the 234 new tumors in this 295 tumor sample cohort we calculated the conelation coefficient of the expression of the 70 genes with the previously determined average profile of these genes in tumors of good prognosis patients (CI) (see Example 4). A tumor with a conelation coefficient > 0.4 (a threshold previously determined in the training set of 78 tumors that allowed 10% false negatives) was then assigned to the "good prognosis" signature group and all other tumors were assigned to the "poor prognosis" signature group. To avoid overfilling by the 61 previously used pNO patients, the performance cross-validated conelation coefficients were used for the prognosis classification with a threshold

165 conelation coefficient value of 0.55 (conesponding to the threshold for 10% false negatives of this cross-validated classifier). 4. Statistical analysis

In the analysis of distant metastasis-free probabilities, patients whose first event was distant metastases were counted as failures; all other patients were censored at the date of their last follow-up, non-breast cancer death, local-regional recunence or second primary malignancy, including contralateral breast cancer. Time was measured from the date of surgery. Metastasis-free curves were drawn using the method of Kaplan and Meier and compared using the log-rank test. Standard enors (SEs) of the metastasis- free percentages were calculated using the method of Tsiatis (Klein, Scand. J. of Statistics 18:333-340 (1991)).

Proportional hazard regression analysis (Cox, J. R. Statist. Soc. B 34:187- 220 (1972)) was used to adjust the association between the conelation coefficient CI and metastases for other variables. SE's were calculated using the sandwich estimator (Lin and Wei, J. Amer. Stat. Assoc. 84:1074-1079 (1989)). Histological grade, vascular invasion and the number of axillary lymph node metastases (0 vs. 1-3 vs. 1) were used as variables. Linearity of the relation between In (relative hazard) and tumor diameter, age and expression level of ER was tested using the Wald test for non-linear components of restricted cubic splines (Therneau et al, Biometrika 77:147-160 (1990)). No evidence for non-linearity was found (age: p=0.83, tumor diameter: p=0.75, number of positive nodes: p=0.65 and ER expression level: p=0.27). Non-proportionality of the hazard was tested using the Grambsch and Therneau method (Grambsch and Therneau, Biometrika 81:515-526 (1994)). In addition, for CI the difference between the ln(hazard ratio) before and after 5 years of follow-up was tested using the Wald test. All calculations were done using the Splus2000 or Splus6 statistical package. 5. Prognosis signature of 295 breast cancers

From each of the 295 tumors, total RNA was isolated and used to generate cRNA, which was labeled and hybridized to microanays containing -25,000 human genes (see Materials and Methods). Fluorescence intensities of scanned images were quantified and normalized to yield the transcript abundance of a gene as an intensity ratio as compared to a reference pool of cRNA made up of equal amounts of cRNA of all tumors combined. The gene expression ratios of the previously determined 70 prognosis marker genes for all 295 tumors in this study are shown in Figure 25A. Tumors above (i.e., having a conelation coefficient greater than) the previously determined threshold

166 (dotted line) were assigned to the "good prognosis" category (n=l 15); those below the line were assigned to the "poor prognosis" category (n=180). Figure 25B displays the time to distant metastases as a first event (black dots) or the time of follow-up for all other patients (gray dots, see methods). Figure 25C shows the lymph node status, distant metastases and survival for all 295 patients. By comparing Figure 25 A, 25B, and 25C, it can be seen that there is a strong conelation between having the good prognosis signature and absence of (early) distant metastases or death. Lymph node negative and positive patients are evenly distributed, indicating that the prognosis profile is independent of lymph node status.

Table 9 summarizes the association between the prognosis profile and clinical parameters, which reveals that the prognosis profile is associated with histo logical grade, ER status and age, but not significantly with tumor diameter, vascular invasion, number of positive lymph nodes, or with treatment.

Table 9. Association of clinical parameters with the prognosis signature groups based on the expression of the prefened 70 prognostic marker genes.

167

* Poor versus good profile.

6. Prognostic value of gene expression signature

Distant metastasis-free probability and overall survival were calculated for all patients having tumors with either a "good" or "poor prognosis" signature (FIGS. 26A and 26B, Table 10). The resulting Kaplan-Meier curves showed a large difference in metastasis rate and overall survival between the "good prognosis" and "poor prognosis" signature patients. For metastasis as a first event, the hazard ratio (HR) for "poor" versus "good" signature over the whole follow-up period is estimated to be 5.1 (95% CI: 2.9-9.0; p<0.0001). The prognosis profile was even more significant for the first 5 years (HR 8.8; 95% CI: 3.8-20; p<0.0001) as compared to a HR of 1.8 (95% CI: 0.69-4.5; p=0.24) after 5 years. The HR for overall survival is 8.6 (95% CI: 4-19; pθ.0001).

The prognosis profile was first identified within a selected group of lymph node negative patients. Here, we wished to determine the performance of the prognostic signatures in both lymph node negative and positive patients. In the series of 151 lymph node negative patients (of the 295 patient cohort), the prognosis profile performed extremely well in predicting outcome of disease (FIGS. 26C, 26D; Table 10). For this group of patients, the HR for developing distant metastases is 5.5 (95% CI 2.5-12.2; pO.OOOl). To validate our estimated odds ratio for metastases development within five years of our previous study (cross-validated odds ratio 15 (95% CI 4-56; p< 0.0001), we calculated the odds ratio for 67 new pNO patients, who were selected the same way as before (patients with either distant metastases within five years (n=12), or who remained disease-free with a follow-up for at least 5 years (n=55)). The odds ratio of the prognosis classifier for metastases within five years in this validation set is 15.3 (95% CI 1.9-125, p=0.011) (2x2 table, data not shown), in good agreement with our previous findings. These consistent performance results on two sets of tumors, highlight the value of the prognosis profile and the robustness of the profiling technology. Significantly, in the remaining group of 144 lymph node positive patients the prognosis profile was also strongly associated with outcome (FIGS. 26E, 26F, Table 10). Here, the hazard ratio for developing distant metastases is 4.5 (95% CI 2.0-10.2; p=0.0003).

168 Table 10. Percentages metastasis-free and overall survival for the prognosis signature groups

^§ No distant metastasis as first failure.

7. Multivariable analysis

Results from the multivariable analysis of distant metastases as first event including age, diameter, number of positive nodes, grade, vascular invasion, ER expression, treatment and the gene expression profile are shown in Table 11. The only independent predictive factors were the 70 gene expression profile, tumor diameter and adjuvant chemotherapy. During the period in which these patients were treated, the majority of premenopausal lymph node positive patients received adjuvant chemotherapy; lymph node negative patients usually did not receive adjuvant treatment. There was improved survival for patients who received adjuvant chemotherapy in this series of tumors. The 70 gene expression profile is by far the strongest predictor for distant metastases with an overall hazard ratio of 4.6 (95% CI: 2.3-9.2; pθ.0001). This is not unexpected, since the prognosis profile was established based on tumors from patients that all developed distant metastases within five years.

169 Table 11. Multivariable proportional hazard analysis for metastasis as first event of the prognosis profile in combination with clinicopathological variables.

The prognosis profile is also a strong predictor of developing distant metastases within the group of lymph node positive patients (see FIGS. 26E, 26F). This is remarkable, since the presence of lymph node metastases by itself is a strong predictor of poor survival. Because most patients with lymph node positive breast cancer in our study received adjuvant chemotherapy or adjuvant hormonal therapy (120 out of 144 patients), it is not possible to give the prognostic value of the profile in untreated lymph node positive patients. There is, however, no indication that there is a difference in the prognostic value of the prognosis profile between patients who received adjuvant chemotherapy compared to those who did not (data not shown).

A key question is whether the prognosis profile is a more useful clinical tool to detennine eligibility for adjuvant systemic treatment than the presently used "St. Gallen" and "NEH-consensus" criteria, which are based on histological and clinical characteristics (see Goldhirsch et al, Meeting Highlights: International Consensus Panel on the Treatment of Primary Breast Cancer, Seventh International Conference on Adjuvant Therapy of Primary Breast Cancer, J Clin. Oncol. 19(18):3817-3827 (2001); Eifel et al, National Institutes of Health Consensus Development Conference Statement: Adjuvant Therapy for Breast Cancer, November 1-3, 2000, J. Natl. Cancer Inst. 93(13):979-989 (2001)). FIG. 27 shows the Kaplan-Meier metastasis-free curves for the

170 151 lymph node negative patients, where the patients were classified as "good prognosis/low-risk" or "poor prognosis/high-risk" using the prognosis profile (FIG. 27 A), the "St. Gallen" (FIG. 27B) or the "N H-consensus" criteria (Fig. 27C).

Two major conclusions can be drawn from this comparison. First, the prognosis profile assigns many more pNO patients to the low-risk group than the traditional methods (38% for "profile", versus 15% for "St. Gallen" and 7% for "NEH consensus"). Second, low-risk patients identified by expression profiling have better metastasis-free survival than those classified by "St. Gallen" or "NEH consensus" criteria. Conversely, patients classified as high-risk according to their expression profile tend to develop distant metastases more often than the high risk "St. Gallen" or "NIH consensus" patients. This indicates that both "St. Gallen" and "NEH" criteria misclassify a significant number of patients. Indeed, the high-risk group as defined by "NTH consensus" criteria contains a significant number of patients having a "good prognosis" signature and conesponding outcome (FIG. 27E). Conversely, the low-risk NEH group includes patients with a "poor prognosis" signature and outcome (FIG. 27G). Similar subgroups can be identified within the "St. Gallen" low- and high-risk patients (Fig. 27D; 27F). Since both "St. Gallen" and "NTH" subgroups contain misclassified patients (who can be better identified through the prognosis signature), these patients are either over- or undertreated in present clinical practice. Tumor size is a major parameter used in the "NEH-consensus" criteria for adjuvant therapy selection. However, the data above (see Table 9) show that the ability to develop distant metastases is only partially dependent on tumor size and suggest that metastatic capacity in many tumors is an early and inherent genetic property.

The "good prognosis groups" can be subdivided into two groups whose treatment regimens differ. The subgroups were determined by using another threshold in the conelation with the average profile of the good prognosis tumors. In the initial study that identified markers conelated with a good prognosis (see Example 4), we found that tumors having a conelation coefficient of greater than 0.636 (i.e., whose expression profiles conelated most strongly with the average expression profile of the "good prognosis" group) did not give rise to distant metastases. This was detennined empirically for the 78 patient tumor samples by determining the conelation coefficient, in the ranked list, above which patients developed no distant metastases (data not shown). Thus, among the tumors previously identified as having a "good prognosis" signature, those that had a conelation coefficient exceeding 0.636 were classified as having a "very

171 good prognosis" signature. These patients with such a "very good prognosis" signature in their tumor (FIGS. 28A-28F, upper line) have an even better outcome of disease than those having an "intermediate prognosis" signature (remaining "good prognosis" signature patients, conelation coefficient between 0.4 and 0.636, FIGS. 28A-28F, middle line). This is true for the entire cohort (FIGS. 28A, 28B) as well as the lymph node negative (FIGS. 28C, 28D) and positive patients separately (FIGS. 28E, 28F).

Together, our data indicate that the prognosis profile is a more accurate tool to select lymph node negative premenopausal patients for adjuvant systemic therapy than the presently used consensus criteria and may even be useful to guide adjuvant therapy in lymph node positive patients. We propose the following treatment regimens based upon the particular marker expression signature:

(1) Lymph node negative patients having a tumor with a "very good prognosis" signature can be treated without adjuvant systemic therapy.

(2) Lymph node negative patients having a tumor with an "intermediate prognosis" signature can be freated with adjuvant hormonal therapy only. As 97% of tumors having the "intermediate prognosis" signature are ER positive, this group of patients should benefit from adjuvant hormonal treatment. Adding chemotherapy to the treatment regimen of this patient group would result in only marginal survival benefit.

(3) All other patients should receive adjuvant chemotherapy. Where the tumor is ER+, hormonal therapy is also recommended.

Implementation of the use of the prognostic profile in breast cancer diagnostics should result in improved and patient-tailored adjuvant systemic treatment, reducing both over- and undertreatment.

7. REFERENCES CITED All references cited herein are incoφorated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.

Many modifications and variations of the present invention can be made without departing from its spirit and scope, as will be apparent to those skilled in the art. The specific embodiments described herein are offered by way of example only, and the invention is to be limited only by the terms of the appended claims along with the full scope of equivalents to which such claims are entitled.

172

Claims

What is claimed is:

1. A method for classifying a breast cancer patient according to prognosis, comprising: a. determining the similarity between the level of expression of each of at least five genes for which markers are listed in Table 5 in a cell sample taken from said breast cancer patient, to control levels of expression for each respective said at least five genes to obtain a patient similarity value; b. providing selected first and second threshold values of similarity of said level of expression of each of said at least five genes to said control levels of expression to obtain first and second similarity threshold values, respectively, wherein said second similarity threshold indicates greater similarity to said control than does said first similarity threshold; and c. classifying said breast cancer patient as having a first prognosis if said patient similarity value exceeds said first and said second similarity threshold values, a second prognosis if said level of expression of said genes exceeds said first similarity threshold value but does not exceed said second similarity threshold value, and a third prognosis if said level of expression of said genes does not exceed said first similarity threshold value or said second similarity threshold value.

2. The method of claim 1, further comprising determining prior to step (a) said level of expression of said at least five genes.

3. The method of claim 1, wherein said determining in step (a) is carried out by a method comprising determining the degree of similarity between the level of expression of each of said at least five genes in a sample taken from said breast cancer patient to the level of expression of each of said at least five genes in a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis.

4. The method of claim 1, wherein said determining in step (a) is carried out by a method comprising determining the difference between the absolute expression level of each of said at least five genes and the average expression level of each of said at least five genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis.

5. The method of claim 1, wherein said first threshold value and said second threshold value are coefficients of conelation to the mean expression level of each of said

173 at least five genes in a pool of tumor samples obtained from a plurality of breast cancer patients who have had no relapse of breast cancer within five years of initial diagnosis.

6. The method of claim 5, wherein said first threshold similarity value and said second threshold similarity values are selected by a method comprising: a. rank ordering in descending order said tumor samples that compose said pool of tumor samples by the degree of similarity between the level of expression of each said at least five genes in each of said tumor samples to the mean level of expression of said at least five genes of the remaining tumor samples that compose said pool to obtain a rank-ordered list, said degree of similarity being expressed as a similarity value; b. determining an acceptable number of false negatives in said classifying step, wherein a false negative is a breast cancer patient for whom the expression levels of said at least five genes in said cell sample predicts that said breast cancer patient will have no distant metastases within the first five years after initial diagnosis, but who has had a distant metastasis within the first five years after initial diagnosis; c. determining a similarity value above which in said rank ordered list fewer than said acceptable number of tumor samples are false negatives; d. selecting said similarity value determined in step (c) as said first threshold similarity value; and e. selecting a second similarity value, greater than said first similarity value, as said second threshold similarity value.

7. The method of claim 6, wherein said second threshold similarity value is selected in step (e) by a method comprising determining which of said tumor samples, taken from said breast cancer patients having a distant metastasis within the first five years after initial diagnosis, in said rank ordered list has the greatest similarity value, and selecting said greatest similarity value as said second threshold similarity value.

8. The method of claim 6, wherein said first and second threshold similarity values are conelation coefficients, and said first threshold similarity value is 0.4 and said second threshold similarity value is greater than 0.4.

9. The method of claim 6, wherein said first and second threshold similarity values are conelation coefficients, and said second threshold similarity value is 0.636.

10. A method of assigning a therapeutic regimen to a breast cancer patient, comprising:

174 a. classifying said patient as having a "poor prognosis," "intermediate prognosis," or "very good prognosis" on the basis of the levels of expression of at least five genes for which markers are listed in Table 5; and b. assigning said patient a therapeutic regimen, said therapeutic regimen (i) comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or (ii) comprising chemotherapy if said patient has any other combination of lymph node status and expression profile.

11. A method of assigning a therapeutic regimen to a breast cancer patient, comprising: a. determining the lymph node status for said patient; b. determining the level of expression of at least five genes for which markers are listed in Table 5 in a cell sample from said patient, thereby generating an expression profile; c. classifying said patient as having a "poor prognosis," "intennediate prognosis," or "very good prognosis" on the basis of said expression profile; and d. assigning said patient a therapeutic regimen, said therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or comprising chemotherapy if said patient has any other combination of lymph node status and classification.

12. The method of claim 11 in which said therapeutic regimen assigned to lymph node negative patients classified as having an "intermediate prognosis" additionally comprises adjuvant hormonal therapy.

13. The method of claim 11, wherein said classifying step (e) is canied out by a method comprising: a. rank ordering in descending order a plurality of breast cancer tumor samples that compose a pool of breast cancer tumor samples by the degree of similarity between the level of expression of said at least five genes in each of said tumor samples and the level of expression of said at least five genes across all remaining tumor samples that compose said pool, said degree of similarity being expressed as a similarity value; b. determining an acceptable number of false negatives in said classifying step, wherein a false negative is a breast cancer patient for whom the

175 expression levels of said at least five genes in said cell sample predicts that said breast cancer patient will have no distant metastases within the first five years after initial diagnosis, but who has had a distant metastasis within the first five years after initial diagnosis; c. determining a similarity value above which in said rank ordered list said acceptable number of tumor samples or fewer are false negatives; d. selecting said similarity value determined in step (c) as a first threshold similarity value; e. selecting a second similarity value, greater than said first similarity value, as a second threshold similarity value; and f. determining the similarity between the level of expression of each of said at least five genes in a breast cancer tumor sample from the breast cancer patient and the level of expression of each of said respective at least five genes in said pool, to obtain a patient similarity value, wherein if said patient similarity value equals or exceeds said second threshold similarity value, said patient is classified as having a "very good prognosis"; if said patient similarity value equals or exceeds said first threshold similarity value, but is less than said second threshold similarity value, said patient is classified as having an "intermediate prognosis"; and if said patient similarity value is less than said first threshold similarity value, said patient is classified as having a "poor prognosis."

14. The method of claim 11 which further comprises detennining the estrogen receptor (ER) status of said patient, wherein if said patient is ER positive and lymph node negative, said therapeutic regimen assigned to said patient additionally comprises adjuvant hormonal therapy.

15. The method of claim 11 , wherein said patient is 52 years of age or younger.

16. The method of claim 11 or 15, wherein said patient has stage I or stage II breast cancer.

17. The method of claim 11 , wherein said patient is premenopausal.

18. A computer program product for classifying a breast cancer patient according to prognosis, the computer program product for use in conjunction with a computer having a memory and a processor, the computer program product comprising a computer readable storage medium having a computer program encoded thereon, wherein said computer program product can be loaded into the one or more memory units of a

176 computer and causes the one or more processor units of the computer to execute the steps of: a. receiving a first data structure comprising the respective levels of expression of each of at least five genes for which markers are listed in Table 5 in a cell sample taken from said patient; b . determining the similarity of the level of expression of each of said at least five genes to respective control levels of expression of said at least five genes to obtain a patient similarity value; c. comparing said patient similarity value to selected first and second threshold values of similarity of said respective levels of expression of each of said at least five genes to said respective control levels of expression of said at least five genes, wherein said second threshold value of similarity indicates greater similarity to said respective control levels of expression of said at least five genes than does said first threshold value of similarity; and d. classifying said patient as having a first prognosis if said patient similarity value exceeds said first and said second threshold similarity values; a second prognosis if said patient similarity value exceeds said first threshold similarity value but does not exceed said second threshold similarity value; and a third prognosis if said patient similarity value does not exceed said first threshold similarity value or said second threshold similarity value.

19. The computer program product of claim 18, wherein said first threshold value of similarity and said second threshold value of similarity are values stored in said computer.

20. The computer program product of claim 18, wherein said respective control levels of expression of said at least five genes is stored in said computer.

21. The computer program product of claim 18 wherein said first prognosis is a "very good prognosis"; said second prognosis is an "intermediate prognosis"; and said third prognosis is a "poor prognosis"; wherein said computer program may be loaded into the memory and further cause said one or more processor units of said computer to execute the step of assigning said breast cancer patient a therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node negative and is classified as having a good prognosis or an intermediate prognosis, or comprising chemotherapy if said patient has any other combination of lymph node status and expression profile.

177

22. The computer program product of claim 18 wherein said computer program may be loaded into the memory and further causes said one or more processor units of the computer to execute the steps of receiving a data structure comprising climcal data specific to said breast cancer patient.

23. The computer program product of claim 18 wherein said respective control levels of expression of said at least five genes comprises a set of single-channel mean hybridization intensity values for each of said at least five genes, stored on said computer readable storage medium.

24. The computer program product of claim 23 wherein said single-channel mean hybridization intensity values are log transformed.

25. The computer program product of claim 18 wherein said computer program product causes said processing unit to perform said comparing step (c) by calculating the difference between the level of expression of each of said at least five genes in said cell sample taken from said breast cancer patient and said respective control levels of expression of said at least five genes.

26. The computer program product of claim 18 wherein said computer program product causes said processing unit to perform said comparing step (c) by calculating the mean log level of expression of each of said at least five genes in said control to obtain a control mean log expression level for each gene, calculating the log expression level for each of said at least five genes in a breast cancer sample from said patient to obtain a patient log expression level, and calculating the difference between the patient log expression level and the control mean log expression for each of said at least five genes.

27. The computer program product of claim 18 wherein said computer program product causes said processing unit to perform said comparing step (c) by calculating similarity between the level of expression of each of said at least five genes in said cell sample taken from said patient and said respective control levels of expression of said at least five genes, wherein said similarity is expressed as a similarity value.

28. The computer program product of claim 27 wherein said similarity value is a conelation coefficient.

178