2. Technical description of the creation of the list and the QSAR models used, Danish Environmental Protection Agency

Report on the Advisory list for selfclassification of dangerous substances

2. Technical description of the creation of the list and the QSAR models used

2.1	Introduction
2.1.1	SAR / QSAR
2.1.2	The domain of the models
2.1.3	Accuracy of the model predictions
2.1.4	Software
2.2	Methodology in making the list
2.2.1	The selected dangerous properties
2.2.2	The evaluated chemical substances
2.2.3	Test data
2.2.4	Use of QSAR models
2.2.5	The result
2.3	Acute oral toxicity
2.4	Sensitization by skin contact
2.5	Mutagenicity
2.6	Carcinogenicity
2.7	Danger to the aquatic environment

2.1 Introduction

In a field developing as rapidly as QSAR’s are today, there will always be better models, better validations and new endpoints becoming available - and consequently never a "right" time to release advisory classification based on them. It is however, felt that considerable information has been accumulated which can now be of help in the otherwise difficult task of assessing the toxicology of many thousands of otherwise untested chemicals. This knowledge may also be of assistance in helping to direct future testing plans to the areas for which it is most urgently needed.

2.1.1 SAR / QSAR

The concept that similar structures will have similar properties is not new. Already in the 1890’s it was discovered, for example, that the anaesthetic potency of substances to aquatic organisms was related to their oil/water solubility ratios, a relationship which led to the use of LogP (octanol/water) as a prediction of this effect. Today it is known that all chemicals will exhibit a minimum or "basal" narcotic effect, which is related to their absorption to cell membranes, and which is well predicted by their lipophilic profile.

SARs and QSARs ((Quantitative) Structure Activity Relationships) are based on a comparison of the structure and physico-chemical properties (descriptors) with measured parameters or endpoints for a range of chemicals called a training set.

The endpoint may for instance be another physical-chemical property or it may be a biological effect. The descriptors may include LogP, molecular indices, quantum mechanical properties, shape, size, charge, distributions, etc. The comparison is often made with statistical tools. The goal is to determine which descriptor(s) are in an essential way connected with the endpoint in question, and to set up a relationship between these descriptors and the endpoint.

When the result is expressed qualitatively the relationship is a SAR, and when the result is expressed quantitatively the relationship is a QSAR. A QSAR is a relation between the quantitative descriptors for chemical substances and a more or less graduated scale of property or effect.

Once a correlation between structure / properties is established it can be used for predictions of the endpoints for other chemicals, for which the descriptors are known or can be estimated. In general, development and use of the correlations are done by computers.

2.1.2 The domain of the models

The domain limits the QSARs use to the endpoint being modelled and the group of substances for which it is valid. The domain of the QSAR is defined in the selection of the training set; the coverage of the descriptors of the training set define the "area" of "the chemical universe" for which the model is valid.

2.1.3 Accuracy of the model predictions

In order to check a models predictive ability it should be validated. Validation is a trial of the model performance for a set of substances independent of the training set, but within the domain of the model. The model predictions for these substances are compared with measured endpoints for the substances in order to establish the accuracy of the models.

Ideally all models should be assessed by seeing how well they predict the activity of chemicals, which were not used to make them. This is not, however, always simple. In part valuable information may be left out by setting aside chemicals to be used in such an evaluation, and in part it can be extremely difficult to assess how "external" chemicals relate to the model’s domain; that is, if they represent a random distribution within this thereby giving a fair picture of the performance of the model.

This problem is often addressed by using one or another form of cross-validation. Statistical evaluation is an extremely important method of determining the performance of these models, and in some cases (where there is little or no test data to be found which was not used to develop the model) it is the only method available.

The validation techniques most commonly mentioned in this report include the "drop one" "Q²" procedure, where one substance at a time is removed, and then predicted by a model made on the remainder of the training set. This is done once for every substance. While widely used, this form of cross-validation can have a tendency to over-predict goodness of fit.

A more robust technique for these data sets is for example the "3x10% out", which consists of removing a random sample of 10% three times, and each time making a new model which is then used to predict the excluded chemicals. Instead of running this process three times it can be run until all of the chemicals have been estimated. However, three runs will generally be sufficient to establish the correlation /50/.

For the validation of a parametric model the result can be expressed as the sensitivity, the specificity and the concordance of the model. The sensitivity is a measure of how well the model "catches" the substances with the effect being modeled. A sensitivity of 80% means that 80% of the "true positives" in the validation set were correctly predicted as positives, and that the remaining 20% were falsely predicted as negatives (false negatives). The specificity is a measure of how many false positives the model predicts. A specificity of 80% means that 80% of the "true negatives" in the validation set were correctly predicted as negatives, and the remaining 20% of the negatives were falsely predicted as positives (false positives). The concordance is an overall measure of the correctness of the predictions. A concordance of 80% means that 80% of the substances in the validation set were correctly predicted as positives or negatives, and the remaining 20% are the false predictions (false negatives and false positives).

Predictive ability will vary depending on both the method used, and the endpoint in question. In general, predictive ability of contemporary QSAR systems can often correctly predict the activity of about 70 – 85% of the chemicals examined, provided that the query structures are within the domains of the models /53,54/. This also applies to the models described in this paper. Of course, a model can never be more accurate then the test data on which it was based. Therefore it is extremely important to be aware of the accuracy and reproducibility of the test data used for making a model. If a biological test gives the wrong results 17% of the time, the "perfect" model based on these tests would also be wrong in 17% of the time.

In addition to assessing the predictive ability of a model, it is also necessary to consider in which context it will be used. In some cases a large number of "false positives" or "false negatives" may be acceptable, while in others they will not be. In this exercise there was no deliberate attempt to adjust the weight of these factors in either direction. The specific "context " in which these models have been used is simply that where there are no tests or other information available, the alternative is that the substance is not assessed at all for the endpoints covered.

2.1.4 Software

Today numerous computerized systems exist for predicting a large range of effects reaching from biodegradability to cancer. These include fragment based^* statistical systems such as TOPKAT and M-CASE, as well as three-dimensional Modelling of ligand docking^** such as Comparative Molecular Field Analysis (COMFA). Mention should also be made of OASIS /46,47/, a sophisticated program package able to estimate a wide variety of effects using 3-D and Quantum Mechanical parameters, and which is currently being used to estimate binding of chemicals to Estrogen receptors /48/.

In essence, these programs don’t really do anything "new." They are simply grouping substances with similar structures and similar effects, including use of global or local parameters such as LogP and electrophilicity in much the same way as an expert might do. However, they do this at very high speed and take account of a large number of factors simultaneously (such as critical inter-atomic distances) which can assist an expert in finding hitherto unobserved relationships. In addition, the programs TOPKAT and M-CASE described below, emulate another human characteristic, and reject estimates for chemicals where there is simply not enough information to provide a sound prediction. They accomplish this by iterative statistical methods rather than by human intelligence or intuition.

M-CASE

M-CASE is a knowledge-based artificial intelligence system capable of learning directly from data. Models made in this program can predicts various toxic endpoints on the basis of discrete structural fragments found to be statistically relevant to a specific biological activity, either increasing or decreasing it. The program can thus provide a "chemical" explanation to observed biological properties. It assumes that the presence of fragments previously found in a number of active compounds is indicative of potential activity. This fragment-based method is assumed to be a reasonable basis to assess the activity of new molecules. On the basis of the presence of the fragments in a query molecule the program will estimate a value for its potency by using "local QSARs" for the various fragments. If so found, "global QSARs" like the relation between LogP and toxicity to aquatic organisms may also be included in the model. The program gives a warning if there are fragments in the query molecule, that are not found in the training set of the model, indicating that the query molecule is outside the domain of the model /38,43/. Estimates for substances found to be within the domain of the model and for which sound predictions could be made are referred to as AOKs ("All OK chemicals") in this paper.

TOPKAT

TOPKAT assesses toxicity of chemicals from their molecular structure utilizing QSTR (Quantitative Structure Toxicity Relationship) models for assessing specific adverse health effects /56/. When querying the program by entering a code for chemical structure, the program determines the compound class of the structure for those models which have class-specific sub-models. Next, the system computes the descriptors needed for the specific toxicity model. These consist of for example electrotopological state, kappa indices, molecular weight and symmetry indices. Then the program checks whether all the fragments present in the query molecule were present with adequate frequency in the training set for the specific equation. If there are no missing fragments, the program next checks whether the query is within the optimum prediction space of the equation. If this is the case, the training set of the model is searched for the compounds most similar to the query molecule, and the concordance between the actual and predicted values for those compounds is determined /45/. If there is reasonable agreement between oberserved / predicted values for the four most similar substances the estimate is accepted and referred to as AOKs in this paper.

Epiwin

This suite of programs developed by Syracuse Research Corporation was used to estimate three ecotoxicological parameters – Biodegradation, LogP and Bioconcentration. Unlike TOPKAT and M-CASE, Epiwin does not attempt to define a predictive space, and all estimates were used "as is".

Chem-X

This program has features for making estimates for a large number of physical-chemical properties of chemicals, making 2D- and 3D-QSARs and storing large amounts of data and chemical structures in databases.

The Danish EPA has built up a database in Chem-X which contains QSAR predictions for about 166.000 substances /55/, including almost all of the discrete organic chemicals in Einecs, a total of approximately 47,000 substances. Estimates are available for a number of endpoints covering both health- and environmental concerns. The QSAR estimates for these chemicals create the background for the recommended selfclassifications. Detailed facilities for searching, displaying and manipulating chemical structures are also available in this data package. This tool was used extensively to compare test data, predictions and selected sub-substructures while performing "expert" assessment of the QSAR’s.

Possibilities for dissemination of this database and the detailed QSAR predictions are currently unclear due to issues of copyright.

2.2 Methodology in making the list

2.2.1 The selected dangerous properties

The following endpoints were addressed:

Acute oral toxicity

Sensitization by skin contact

Mutagenicity

Carcinogenicity

Danger to the aquatic environment

2.2.2 The evaluated chemical substances

The overall purpose of the project was to evaluate as many as possible of the substances in Einecs (European Inventory of Existing Commercial Chemical Substances) /2/. The list consists of 100.116 entries, covering organic and inorganic substances in both single substance entries and mixtures.

The screening was limited to cover "discrete organics," meaning that UVCBs (Unknown, Variable Composition and Biologicals) and other ill-defined structures or mixtures were excluded for practical reasons – if you don’t know what it is, you can’t really make a model. Exceptions were made where this seemed logical (C12 – C16 n-alcohols has been entered as C14 n-alcohol – hydrochloride salts have been entered as the parent compound, etc.).

Inorganic substances have likewise not been evaluated. These are usually better approached by simpler methods of evaluating the availability of the respective an- and cations with well known hazard profiles. "Organo-metallics" have also been excluded as being poor candidates for modeling. Finally, as a matter of resources, only such chemicals as were available with 3-D structural information were used /7/.

In so far as this was possible using a CAS number comparison, all substances already classified on Annex I of the formal EU list (List of dangerous substances) were also removed as they should never be the subject of provisional classification.

This resulted in a total of 46,707 or about half of all Einecs chemicals, which could be subjected to screening.

2.2.3 Test data

For the vast majority of the chemicals no measured data was available. However, if measured data were available as part of the model, this was generally used in preference to the estimates.

It is important to stress that no attempt was made to search the world’s published or unpublished databases for toxicological information to determine whether a QSAR was even necessary for each endpoint. This task is the responsibility of the manufacturer / importer of the individual chemicals.

2.2.4 Use of QSAR models

The technical specifications for the models and a description of the criteria for assignment of advisory classifications for each effect are given in the technical sections for the individual endpoints.

It should also be stressed that the models available do not predict a "classification" – they predict biological activity that may lead to a classification. Further criteria have therefore been applied to each endpoint to try and link the biological prediction with a risk phrase. Because of the large number of chemicals involved, "rules" were used to achieve this purpose. Such rules are also imperfect, but in essence the process is no different than that imposed upon a human expert forced to use common sense to provide a provisional classification for any given substance for which the desired test data does not exist.

Only model predictions that satisfied a formal criteria were used:
For TOPKAT the predictions had to fall within the optimum prediction space of the model, and the four most statistically relevant observed/predicted chemicals referenced by the model should be within acceptable agreement. The predictions fulfilling these criteria are referred to as AOKs.

For M-CASE the predictions had to fall within the optimum prediction space of the model, meaning that there were no unknown fragments, and that there was sufficient knowledge about the known fragments to give an unequivocal prediction.

As described in the technical sections, expert inspection has been undertaken where time allowed to confirm the probable activities given by the QSARs. This has included evaluation of the QSAR estimates in comparison with known biological activities and chemical properties. No in depth toxicological assessment of the individual chemical substances has been undertaken. Questionable QSAR predictions for each endpoint were excluded.

The effort used on expert inspection varied with the endpoint in question. In general most time was used in assessing the predictions for Mutagenicity and Carcinogenicity, and least was used on Allergy and Aquatic Effects.

2.2.5 The result

It is important to understand that the results as given in the Advisory list only represent POSITIVE predictions. No distinction has been made between a negative prediction for an endpoint, and an unreliable prediction (a non-AOK prediction) which was simply discarded.

Evaluated substances not on the list, or substances which are on the list but without advisory classifications for one or more of the selected dangerous properties, may have been predicted as not having this / these dangerous properties, or the models may not have been valid for this substance.

Therefor the advisory list can not be used to conclude that these substances do not posess dangerous properties. Depending on the endpoint in question, unreliable predictions were obtained for between 5 and 65% of the chemicals examined.

2.3 Acute oral toxicity

EU criteria for classification

The formalized criteria for classification for acute oral toxicity includes a number of options of tests including fixed-dose procedure and interpretation of the various sources of information about acute oral toxicity, but is often based on acute LD₅₀ tests in the rat for which the following classification criteria are used:

Table 3
Classification criteria

Classification criteria	Classification
LD₅₀ oral, rat � 25 mg/kg	T+;R28 (very toxic; very toxic if swallowed)
25 mg/kg < LD₅₀ oral, rat � 200 mg/kg	T;R25 (toxic; toxic if swallowed)
200 mg/kg < LD₅₀ oral, rat � 2,000 mg/kg	Xn;R22 (harmful; harmful if swallowed)

Evaluation based on QSAR models

An advisory classification of Xn;R22 is recommended in all cases where a rat oral LD₅₀ of � 2000 mg/kg is predicted or based on measured data. For reasons indicated below, no attempt was made to differentiate between the different levels of acute toxicity, and it is important to recognize that this classification will often be less stringent than classification based on measured data.

If test results measured in the rat were readily available (had been used to make the model) these took precedence over any predictions.

As acute toxicity data from the mouse following a variety of different routes of administration was also available in some cases, this was used to predict rat oral LD₅₀’s using the QSARs preferentially as follows /8,9/:

Table 4

1.	Log LD₅₀oral, rat = 0.731 + 0.841 * (Log LD₅₀oral, mouse) RTECS data 1989, n=3919, R²= 0.750, Q² = 0.749
2.	Log LD₅₀ oral, mouse = 0.682 + 0.373 * (Log LD₅₀ iv, mouse) + 0.518 * (Log LD₅₀ ip, mouse) RTECS data 1994, n = 286, R² = 0.766, Q² = 0.764
3.	Log LD₅₀ oral, mouse = 0.731 * (Log LD₅₀ iv, mouse) RTECS data 1994, n=286, R² = 0.724, Q² = 0.724
4.	Log LD₅₀ oral, mouse = 0.945 + 0.802 * (Log LD₅₀ iv, mouse) RTECS data 1994, n=286, R² = 0.689, Q² = 0.688

iv: Intravenous
ip: Intraperitonial

Biological data consisting of LD₅₀’s in mice or rats was available for just over 10% of the chemicals processed. If no biological data were available, rat oral LD₅₀ was estimated according to the QSTR model TOPKAT (v 5.01).According to TOPKAT, the model contains about 4,000 substances and their own cross-validation for this endpoint indicates 86-100% of estimations falling within a factor of five from test results /10/.

Danish EPA’s external evaluation of this model using 1,840 chemicals not contained in the TOPKAT data set gave somewhat poorer results; R² = 0.31. According to this evaluation 86% of estimations fall within a factor of ten from test results /11/. The distribution can be seen in table 5.

Table 5

Result predicted within a factor of:	%	N (cumulative)
2	42	671
4	67	1,069
6	78	1,235
8	83	1,323
10	86	1,368

In modern LD₅₀tests using small numbers of animals, statistical variation is often within a factor of 2-4, and inter-laboratory variations of up to 10 are not uncommon /12/. While the TOPKAT model is clearly not perfect, it is still considered sufficient to give an approximation for the suggested least strict classification for acute toxicity, Xn;R22. However, the accuracy of the model is not considered to be sufficient to differentiate between the three different levels of acute toxicity ("hamful", "toxic" and "very toxic"). It is therefor important to recognize, that there will be substances, though given with the advisory classification Xn;R22 on the list, which on the basis of for instance animal test should be classified as T;R25 or Tx;R28.

Where TOPKAT was able to make a robust prediction (AOK) it found 57% of all chemicals to have an acute oral LD₅₀ in rat of � 2,000 mg/kg. The percentage of chemicals with acute toxicity’s of � 2,000 mg/kg for 12,632 chemicals tested for acute toxicity in rat found in the Registry of Toxic Effects of Chemical Substances (RTECS 1998) /52/ was 61%. That these two percentages are so similar is not surprising, since RTECS data was also the chief source of biological information used to construct the TOPKAT model.

A schematic diagram of the systematic evaluation is given in figure 2.

Figure 2 Look here!
The systematic evaluation

Approximately 10,200 compounds were estimated as having an acute LD₅₀ in rat of 2,000 mg/kg or less^***. About 700 were removed by expert judgement in an attempt to exclude amino-acid and protein-type compounds which were considered likely to break down due to the effects of gastric acidity, or substances for which gastric absorption was expected to be poor. This resulted in 9,538 substances with an advisory classification of Xn;R22.

2.4 Sensitization by skin contact

EU criteria for classification

Classification as sensitizing by skin contact, R43 ("May cause sensitization by skin contact"), is based either on animal studies or practical experience or combinations thereof. The animal criterion is based on either an adjuvant or non-adjuvant test.

Different adjuvant tests exist, but the Magnusson-Kligmann’s method (GPMT: Guinea Pig Maximization Test) is preferred. Response in 30% of the animals results in classification. For a non-adjuvant test (for example the B�ehler test) 15% responding animals is regarded as positive. The human data can be results from patch testing, case studies or epidemiological studies.

Evaluation based on QSAR models

Two approaches were used to estimate contact sensitisation /14,15/.

The first approach uses two TOPKAT QSTR models. The first model was used to predict "Allergy versus non-allergy", and, in cases where this was positive, the second model was used to predict "Strong versus weak/moderate allergy". The models used were primarily related to the GPMT. Only predictions of "Strong allergy" were considered as being likely to fulfill the EU criteria for R43.

In a second approach, predictions were also made using M-CASE.The data set used to produce the M-CASE models differed somewhat from the TOPKAT set, in that both data from the GPMT and human data were represented. Only positive predictions with M-CASE scores of > 40 (corresponding to "very active") were selected.

Table 6
The models used

Model	Technical specifications
TOPKAT (v. 5.01 1998) No Sensitization vs Any	n=389 GPMT Cross validation result (Q²) /14/: Sensitivity 84-94% Specificity 87-96%
TOPKAT (v. 5.01 1998) Strong vs Weak/Moderate	n=266 GPMT Cross validation result (Q²) /14/: Sensitivity 88-96% Specificity 88-98%
M-CASE (v. 3.320 1999) Model A33: Allergic Contact dermatitis	n=1,034 GPMTor data from human experience Cross validation result (3*10% out) /15/: Sensitivity 69 – 89% Specificity 89– 94% Chi² > 50, p < 0.0001

External validation of both TOPKAT and M-CASE models was also attempted using confidential results from the EU New Chemicals program. Using the two-stage TOPKAT model (n= 64 AOK predictions) 67% of positives were correctly identified, and 77% of negatives. For M-CASE, (n= 75 AOK predictions) 45% of positives were correctly predicted, and 81% of negatives /16/.

It is difficult to know how representative New Chemicals are with regard to the universe of Existing Chemicals. Generally New Chemicals are more complex structures with higher molecular weights. Perhaps the most surprising aspect of this exercise was to find that for over three thousand chemicals that should have been assessed for this endpoint, such a tiny percentage of useful test data could be found.

Compounds predicted as positive by either TOPKAT or M-CASE according to the above criteria were selected, provided that they were either AOK in the first, or contained no unknown fragments or equivocal results in the latter.

While it was considered to use "positive" in both models as a criteria, in the end this seemed inefficient, not so much duo to lack of concordance between model predictions, but because the acceptance domains (AOK or all fragments known) of the two methods differed considerably.

No attempt was made to further reduce the list by systematically applying expert judgement.

A schematic diagram of the systematic evaluation is given in figure 3.

Figure 3
The systematic evaluation

9,668 chemicals met the above criteria, for which an advisory classification of R43 is suggested. This strike many experts as being a rather large number of chemicals and while these models represent the current "state-of-the-art" it may indicate that they are over-sensitive. However, it was very difficult to obtain any reliable indication of how many Existing Chemicals would cause contact allergy if actually tested in animals or humans. Estimates of percentages of allergens on Einecs ranged from 5-25%, with some preference being expressed for 10%, which is the number of Annex I substances currently classified for this effect. It is not possible, however, to estimate the influence of confounders on the distribution represented in Annex I. Positive bias can have been introduced because chemicals testing positive are over-represented. Negative bias can have been caused by the fact that most of the chemicals have never been tested at all. The question of numbers remains open.

2.5 Mutagenicity

EU criteria for classification

The criteria for classification for mutagenicity is divided into 3 different categories:

Classification as mutagen, category 1 (mut1;R46, may cause heritable genetic damage) is based on evidence of a causal association between human exposure to the substance and heritable genetic damage.

Classification as mutagen, category 2 (mut2;R46, may cause heritable genetic damage) is based on animal studies showing mutagenity to germ cells either in assays on germ cells or by demonstrating mutagenic effects in somatic cells in vivo or in vitro as well as metabolic proof that the substances reaches the germ cells.

The criteria for classification as mutagen, category 3 (mut3;R40, possible risks of irreversible effects) is based either on in vivo mutagenicity tests or on cellular interactions with in vitro tests acting as supportive evidence. For this classification, it is not necessary to demonstrate germ cell mutations.

Evaluation based on QSAR models

A number of models were applied for this endpoint. The different models predict a number of genotoxicity endpoints. Induction of micronuclei in vivo, was required, as this demonstrates chromosomal damage in somatic cells in vivo. The remaining endpoints reflect in vitro genotoxicity, where positive results would not normally lead to classification for this effect. However, positive results for these endpoints provide supporting evidence for data from in vivo estimates.

Table 7
The models used

Model	Technical specifications
M-CASE (v. 3.320 1999) Model A2E: Structural Alerts for DNA Reactivity	n=784 Cross validation result (3*10% out) /24/: Sensitivity 85-98% Specificity 60-69% Chi² >22, p< 0.0001
M-CASE (v. 3.320 1999) Model A62: Induction of Micronuclei	n=238 GeneTox chemicals Cross validation result (3*10% out) /30/: Sensitivity 80 –100% Specificity 50 – 70% Chi² >4, p <0.05
TOPKAT (v. 3.01, 1998) Salmonella (Ames) Mutagenicity,	n=1,866 Cross validation result (Q²) /25/: 10 sub-modules with sensitivities and specificity’s of 75-100%. External evaluation (Danish EPA, 1998, n=118) /21/: 82% correct negative predictions, 76% correct positive predictions
M-CASE (v. 3.320 1999) Model A2H: Salmonella (Ames) Mutagenicity	n=2,034 NTP or GeneTox tests Cross validation result (3*10% out) /27/: Sensitivity 75-78.5% Specificity 78.2 – 90% Chi² >150, p <0.0001
M-CASE (v. 3.320 1999) Model A61: Chromosomal Aberrations	n=233 NTP tests in cultured CHO cells Cross validation result (3*10% out) /28/: Sensitivity 44-80% Specificity 50-80% Chi² < 2, p>0.15 (further validation being undertaken)
M-CASE (v. 3.320 1999) Model A2F: Mutations in Mouse Lymphoma	n=210 NTP thymidine kinase in L5178Y cells Cross validation result (3*10% out) /29/: Sensitivity 64-100% Specificity (not determined) Chi² � 2, p=0.15 (further validation ongoing)

If classification had been proposed on measured data, a positive result in the in vivo micronucleus test would have been sufficient evidence on which to base the classification. Since the data is predicted and not measured data, additional support for the prediction was obtained by including a number of other indicators of genotoxicity.

It is not suggested that positive in vitro evidence should also be necessary when classifying substances with positive in vivo test data. However, it was not felt that the QSAR model for the mouse micronucleus test alone was sufficient, and data estimates from additional QSAR’s relevant to the endpoint were therefor used to increase the likelihood of a correct positive prediction.

Chemicals for which model estimates were positive for mouse micronucleus and structural alerts for DNA reactivity (here an exception was made in that predictions with one unknown fragment were also accepted) and which also had two positive genotoxicity endpoints, passed the criteria for the systematic evaluation.

Two models for Salmonella (Ames) mutagenicity were used, a TOPKAT and a M-CASE module respectively. This related primarily to the fact that the models differed with regard to domain, and often a robust prediction was only available for one model. If robust predictions were available for both models, and in disagreement, this was taken into account on a case-by-case basis during the final evaluation.

A schematic diagram of the systematic evaluation is given in figure 4.

Figure 4 Look here!
The systematic evaluation

2,272 Einecs chemicals met the criteria in the systematic evaluation.

As none of these models identifies germ cell mutagenicity, the current QSAR’s do not allow discrimination between the EU categories for mutagenic effects in the three categories and the lower classification is therefore assigned as advisory classification in all cases.

Expert judgment was undertaken to confirm the robustness of the predictions of these 2,272 chemicals. This process included examination of the 2- or 3-d chemical structure, and visual comparison with test data within structural groups. If this procedure raised any doubt, substances were removed from the list for more detailed consideration in the future. This resulted in a final selection of 1,678 substances with an advisory classification mut3;R40.

2.6 Carcinogenicity

EU criteria for classification

This end-point can result in classification in 3 different categories:

Classification as carcinogen in category 1 (carc1;R45, Toxic; may cause cancer or carc1;R49, Toxic; may cause cancer by inhalation) is based on strong causal relationship in humans.

Classification as carcinogen in category 2 (carc2;R45, Toxic; may cause cancer or carc2;R49, Toxic; may cause cancer by inhalation) is based on conclusive animal data from 2 species or 1 species with supportive evidence such as genotoxic effects in vitro or in vivo.

Classification as carcinogen in category 3 (carc3;R40, Harmful; possible risks of irreversible effects") is subdivided into two:

a.	Well-investigated substances with restricted tumorigenic effects. It is normally based on clear data of tumour formation in one species. Mutagenicity data in vitro and in vivo can be used as supportive evidence.

b.	Substances that are insufficiently investigated, but raising concern for man.

Evaluation based on QSAR models

While there are many non-genotoxic carcinogens acting by a wide variety of often-unknown mechanisms, it was chosen to focus here on chemicals likely to cause cancer through a genotoxic mechanism. Therefor a pre-selection criteria for genotoxicity was set up.

The criteria for the pre-selection for carcinogenicity was a positive estimate for structural alerts for DNA reactivity (AOK or one unknown fragment) and two positive AOK genotoxicity predictions out of five models for genotoxicity. The technical specifications for the models used to predict genotoxicity is given in the chapter "Mutagenicity".

As opposed to the selection criteria for mutagenicity, a positive mouse micronucleus test was not demanded, as not all genotoxic carcinogens are necessarily clastogenic (cause loss, addition or rearrangement of parts of chromosomes). This gave a pre-selection of 3.362 Einecs chemicals.

A total of ten cancer models were available, plus four sub-models.

Table 8
The models used

Model	Technical specifications
TOPKAT (v. 3.01 1998) NTP Carcinogenicity: Male Rat	366 NTP rodent studies Cross validation result (Q²) /32/: Sensitivities of 82-87% Specificity’s of 82-88%
TOPKAT (v. 3.01 1998) NTP Carcinogenicity: Female Rat
TOPKAT (v. 3.01 1998) NTP^* Carcinogenicity: Male Mouse
TOPKAT (v. 3.01 1998) NTP Carcinogenicity: Female Mouse
TOPKAT (v. 5.01n 1998) FDA Carcinogenicity: Male Rat	n=384 Cross validation result (Q²) /33/: Sensitivity 91% Specificity 90%
Sub-model: Single vs Multiple Organ Tumors	n= 131 Cross validation result (Q²) /33/: Sensitivity 91% Specificity 96%
TOPKAT (v. 5.0 Feb. 1998) FDA^** Carcinogenicity: Female Rat	n=383 Cross validation result (Q²) /33/: Sensitivity 84% Specificity 89%
Sub-model: Single vs Multiple Organ Tumors	n= 125 Cross validation result (Q²) /33/: Sensitivity 92% Specificity 96%
TOPKAT (v. 5.0 Feb. 1998) FDA Carcinogenicity: Male Mouse	n=316 Cross validation result (Q²) /33/: Sensitivity 82% Specificity 90%
Sub-model: Single vs Multiple Organ Tumors	n=93 Cross validation result (Q²) /33/: Sensitivity 93% Specificity 94%
TOPKAT (v. 5.0 Feb. 1998) FDA Carcinogenicity: Female Mouse	n=312 Cross validation result (Q²) /33/: Sensitivity 86% Specificity 89%
Sub-model: Single vs Multiple Organ Tumors	n=100 Cross validation result (Q²) /33/: Sensitivity 95% Specificity 95%
M-CASE (v. 3.320 1999) Carcinogenic Potency Database model: Rat (Danish EPA version of A0D, Feb. 2000)	n=870 chemicals from the CPDB Cross validation result (3*10%) /34/: Sensitivity 52-67% Specificity 63-68% Chi² � 6, p<0.014 (further validation ongoing)
M-CASE (V. 3.320 1999) Carcinogenic Potency Database model: Mouse (Danish EPA version of A0E, Jan. 2000)	n=720 chemicals from the CPDB Cross validation result (3*10%) /35/: Sensitivity 45-50% Specificity 64-72% Chi^{2 �}2, p = 0.15 (further validation ongoing)

* NTP: National Toxicology Program, USA
** FDA: Food and Drug Administration, USA

The accuracy of these models can be difficult to determine, as there are few independent tests that have not already been used in the construction of the models themselves, which can be used for an independent assessment. This is particularly the case for TOPKAT’s models, where the only real estimates consist of the producers own "1 out" Q² cross-validations. For M-CASE, other statistical methods are available.

In a long-running project, where several cancer models predicted the outcome for NTP chemicals which had not yet been tested, upon completion of these tests (for 45 substances) the general conclusion was that accuracy of around 70% was achieved for clearly carcinogenic or non-carcinogenic substances /31/. Due to the small number of chemicals in this analysis it is difficult to know how much weight can be assigned to the conclusion.

3,362 substances met the pre-selection criteria for genotoxicity. For a substance to be selected as a probable carcinogen it was necessary for the following criteria to be fulfilled:

At least two positive predictions (sub-models excluded) for carcinogenicity. An exception was made for the M-CASE CPDB models. Because the data is less homogeneous, both rat and mouse predictions had to be positive to count as one prediction, and in addition to this the carcinogenic potency had to include TD₅₀’s for tumor induction of less than 1,000 mg/kg/day. These two CPDB models were developed by Danish EPA using M-CASE methodology which is described for this data set in the following references /34,35,40/.

If one or more positive tests could be seen (part of the training set for the model) for any cancer endpoint, this took precedence over model results and resulted in an over-all positive classification recommendation. While in most cases this resulted in little change (the models are heavily biased towards making a correct prediction for substances used to make them), it was felt that there was no reason to artificially reduce the quality of the advisory classification by neglecting to use data, which happen to be present. A schematic diagram of the systematic evaluation is given in figure 5.

Figure 5 Look here!
The systematic evaluation

According to these criteria, 1,272 substances were selected for advisory classification for carcinogenicity. Expert judgment was performed on the QSARs. In this proces, all data was used including predictions of TOPKAT FDA Carcinogenicity sub-models, the probability of rapid metabolism or excretion, and where appropriate, predictions of aryl hydroxylase activity /37/. Where any doubts were raised, substances were removed from this version of the list to be considered in more detail in the future.

This resulted in 652 substances selected for advisory classifications for carcinogenicity. It is not felt that the models employed allow discrimination between classification in the three categories, so the lower classification Carc3;R40 was applied in all cases.

2.7 Danger to the aquatic environment

EU criteria for classification

The classification criteria are composed of three main elements: Biodegradability, Bioconcentration potential, and Toxicity to aquatic organisms. Classifications are assigned according to the following scheme:

Table 9
Classification criteria

Classification	Criteria for acute toxicity to aquatic organisms^*
N;R50 (Dangerous for the environment; very toxic to aquatic organisms)	Acute toxicity � = 1.0 mg/l
N;R50/53 (Dangerous for the environment; very toxic to aquatic organisms; may cause long-term adverse effects in the aquatic environment)	Acute toxicity � 1.0 mg/l and not readily degradable or BCF^**�100
N;R51/53 (Dangerous for the environment; toxic to aquatic organisms; may cause long-term adverse effects in the aquatic environment)	Acute toxicity � 10 mg/l and not readily degradable or BCF � 100
R52/53 (Harmful to aquatic organisms; may cause long-term adverse effects in the aquatic environment)	Acute toxicity � 100 mg/l and not readily degradable
R53 (Harmful to aquatic organisms)	Solubility in water < 1 mg/l and not readily degradable and BCF � 100

* The lowest effect concentration for fish, daphnia or algae is used
** BCF: Bioconcentration factor

Evaluation based on QSAR models

Advisory classifications were assigned on basis of combinations of estimates for biodegradation, bioconcentration and acute toxicity according to the criteria in table 9. Classification with risk phrase R53 alone was not done in this exercise, as the strong co-linearity between water solubility and bioconcentracion factor made it redundant.

Biodegradation

Biodegradability was estimated using the Syracuse BIOWIN program (v. 3.02) /17,41/.Only the linear equation for rapid/non-rapid biodegradation was applied. Previous validation of this parameter compared with MITI "ready/not-ready results showed that while a number of "not-ready" chemicals were missed, 93% of "not ready" predictions were correct /18/. In other words while this model may fail to identify all "non-ready" substances, the number of false predictions for lack of degradability will be acceptably low. A total of about 14,000 Einecs chemicals were found to be "not-readily degradable" according to this criteria /51/^****.

Bioconcentration

The classification and labeling guidelines prefer measured data for Bioconcentration, but as this is seldom available, a LogP (octanol/water) of greater than three is recommended as an indication that BCF will be 100 or greater, in accordance with the linear equation of Vieth and Kosian /41/. While a good rule-of-thumb, this relation both over- and underestimates BCF for many classes of chemicals, and takes no account of the fact that bioconcentration is a bilinear function of LogP, decreasing when this is sufficiently high.

Bioconcentration was therefore predicted using Syracuse BCFWIN (v. 2.13), a method based on a combination of logP (octanol-water) relations and structural fragment categories. This method was evaluated by it’s authors as having a statistical accuracy of R²= 0.74 (n = 694, S.D. 0.65, mean error = 0.47), which is a significant improvement over the standard equation of Vieth and Kosian (log BCF = 0.85 * log Kow – 0.70) where predictions for the same 694 compounds had a statistical accuracy of R² = 0.32 (S.D. 1.62 and mean error = 1.12) /20/. About 11,000 Einecs chemicals were found with BCF estimates of equal to or greater than 100.

No attempt was made to further assess bioaccumulation potential caused by possible presence in the diets of aquatic organisms, as it was not felt that an appropriate general model was available.

Acute toxicity

For aquatic toxicity classifications, values (L(E)C₅₀) for fish, daphnia and algae are recommended (although seldom available for most existing chemicals). In the current exercise it was decided to only use predictions for fish, due to their robustness and the availability of high quality test data for model construction.

For Acute aquatic toxicity to fish a M-CASE model developed by Danish EPA using 96h LD₅₀ data on 569 chemicals from the Duluth Fathead minnow Database /22/ was applied. The model had an R² of 0.85. Cross-validation of this model gave a Q² of 0.735 (3*10% out). A description of the M-CASE methodology used for the Fathead minnow data can be found in the following references /21,42/. Only predictions within the optimum prediction space of the model (no fragment or other warnings) were used.

As there was insufficient test data on the Fathead minnow for very lipophylic substances the M-CASE model was only applied for chemical substances with LogP of six or less. Another relationship was used for chemicals with a LogP of greater than six. Here, all substances were assumed to act by non-polar narcosis, and toxicity at equilibrium was estimated according to a relation to the predicted Bioconcentration factor:

LC₅₀ (equilibrium) = 8.15 mmol /BCF

The choice of 8.15 mmol corresponds to the theoretical level inducing aquatic effects represented by the non-polar narcosis fish QSAR recommended in the EU TGD /41/. Non-polar narcosis Lethal Body Burden’s for fish are generally assumed to be within the range of about 2–8 mmol /23,58/.

While simple LogP (octanol/water) relationships exist for predicting the non-polar narcotic toxicity for fish, daphnia and algae /41/, these do not distinguish specific toxicity’s unique to any of the three taxa, and were not felt to offer any advantage over using the fish models alone, which also adequately predict non-polar narcosis. For all practical purposes, non-polar narcosis induces effects at the same concentration levels in all three taxa /18/.

Using both estimates, about 10,000 Einecs chemicals were found with toxicity’s to fish of LC₅₀ � 100 mg/l.

A schematic diagram of the systematic evaluation is given in figure 6.

Figure 6 Look here!
The systematic evaluation

A total of 8,731 substances were selected according to one of the four classification categories as indicated above. Considering that the number of robust (AOK) predictions for fish toxicity was just fewer than one-half of the chemicals screened, this number seems in reasonable concordance with what would be expected for Existing Chemicals.

The advantages of being able to predict toxic effects specific to both fish, daphnia and algae are obvious, and this can hopefully be accomplished in the future. A M-CASE model for acute toxicity to daphnia has recently been completed by Danish EPA (n = 574, R²= 0.826, 3*10% out Q² = 0.692). It is still being refined, and predictions for all chemicals will soon be available. A M-CASE model for toxicity to algae is under development.

^*	In fragment based programs the prediction is based on the occurence of molecular substructures.

^**	Ligand docking models give predictions of how well a chemical substance fits in a certain 3D structure of a macromolecyle with a biological function like a receptor (for example a hormone receptor).

***	TOPKAT calculations were also performed for Rat Chronic Lowest Observed Adverse Effect Level. Cross validated accuracy of this model was similar to that used for acute toxicity, with 95% of predictions being within a factor of 3-5 of the measured values /13,44/. However, as there is no EU classification criteria related specifically to this endpoint (but rather to "serious morphological or toxicological effects after repeated dosing, R48") no classification suggestions were applied for this endpoint.

****	While too late for this phase of the project, Danish EPA has now developed a M-CASE model for ready biodegradation based on new MITI data, which appears to offer significantly better predictions. 81% correct "not ready" and 76% correct "ready" predictions (3x10% out). An external validation using 72 "not ready" chemicals that had not been used to produce the model gave 89% correct predictions. Analysis and fine-tuning of this model is continuing /19/.