A recent XKCD cartoon featured a comparison of letter frequencies in car names vs in general English text. A similar analysis of drug names, inspired by this cartoon has been performed.
The FDA database of approved drugs was downloaded on September 1, 2015, and the Product.txt file was parsed using R version 3.2.2.
Chemical names and suffixes were removed (e.g. cypionate, methylbromide, monosulfate, HCL, trisilicate, iodide, chlorpheniramine, estradiol, methylsulfate, calcium, malate, bicarbonate, ethanolate, ethnolate, tartrate, dimesylate, hydrobromide, disodium, mesylate, fumarate, succinate, peroxide, potassium, hydrochloride, sulfate, sodium, trihydrate, stearate, acetate, phosphate, maleate) as well as other non-drug-name modifiers (e.g. “preservative free,” “allergy relief,” plus, injection, preserved, tablets, kit, etc.).
English letter frequencies were taken from the Department of Math at Cornell.
The log of the ratio of the frequency of a letter’s occurrence in a drug name to a letter’s frequency in English text is plotted below.
The letters that occur most disproportionately frequently are: Z, X, Q, P, C, L.
Letter | English frequency (%) | Drug name frequency (%) | Log of relative frequency |
---|---|---|---|
Z | 0.07 | 1.10 | 2.75 |
X | 0.17 | 1.51 | 2.19 |
Q | 0.11 | 0.262 | 0.867 |
P | 1.82 | 3.65 | 0.697 |
C | 2.71 | 4.23 | 0.445 |
L | 3.98 | 6.19 | 0.443 |
The letters that occur most disproportionately infrequently are: W, H, G, S, F, K. This may be because the letters H, J, K and W “do not exist in some of the 130 countries that use U.S. generic names, or have different sounds in various languages.”1
Letter | English frequency (%) | Drug name frequency (%) | Log of relative frequency |
---|---|---|---|
W | 2.09 | 0.189 | -2.40 |
H | 5.92 | 2.04 | -1.06 |
G | 2.03 | 1.19 | -0.534 |
S | 6.28 | 4.02 | -0.446 |
F | 2.30 | 1.52 | -0.416 |
K | 0.69 | 0.457 | -0.413 |
The frequencies of vowels in drug names vs general English text are fairly close. The most “drug-like” vowel is I (0.273) and the least “drug-like” vowel is U (-0.213).
Letter | English frequency (%) | Drug name frequency (%) | Log of relative frequency |
---|---|---|---|
I | 7.31 | 9.61 | 0.273 |
A | 8.12 | 9.18 | 0.123 |
O | 7.68 | 8.07 | 0.050 |
E | 12.0 | 10.7 | -0.119 |
U | 2.88 | 2.33 | -0.213 |
If we take ten times the logarithm of the ratio of drug-name letter frequencies to English word frequencies to be a letter’s “score,” we can assign a score to any given word for how “drug-like” a drug’s name is. A randomly selected sample of 10 cancer drugs that were approved or reached phase 3 testing between 2005-2009 was scored.
Drugs are ordered by score, from highest (most “drug-like”) to lowest (least “drug-like”).
Drug name | Score |
---|---|
Arzoxifene | 5.027914 |
Ixabepilone | 3.508626 |
Lapatinib | 1.843902 |
Nilotinib | 1.262482 |
Imatinib | 1.261169 |
Romidepsin | 1.175312 |
Prinomastat | 0.731445 |
Marimastat | 0.4234099 |
Sunitinib | -0.01687164 |
Sorafenib | -0.3061702 |
A set of 10 randomly selected English words was prepared and scored using the same method.
English words are ordered by score, from highest (most “drug-like”) to lowest (least “drug-like”).
English word | Score |
---|---|
Material | 1.322378 |
Happen | 0.7109685 |
Locket | 0.1319142 |
Motionless | -0.08807015 |
Helpful | -0.3304302 |
Damage | -0.4330361 |
Change | -1.758826 |
Trust | -2.213186 |
Sandwich | -3.959021 |
Though | -5.253664 |
Drug name | Score |
---|---|
Ixabepilone | 3.508626 |
Lapatinib | 1.843902 |
Nilotinib | 1.262482 |
Romidepsin | 1.175312 |
Nelarabine | 1.077041 |
Decitabine | 0.4133057 |
Temsirolimus | 0.3078753 |
Sunitinib | -0.01687164 |
Dasatinib | -0.1284099 |
Sorafenib | -0.3061702 |
Drug name | Score |
---|---|
Arzoxifene | 5.027914 |
Semaxanib | 2.812812 |
Nolatrexed | 2.34507 |
Irofulven | 0.7413801 |
Prinomastat | 0.731445 |
Rubitecan | 0.4906218 |
Marimastat | 0.4234099 |
Rebimistat | -0.04639386 |
Atamestane | -0.5360683 |
Affinitak | -0.759761 |
The mean for FDA approved drugs is 0.914 (95% CI -1.337-3.165) and the mean for non-FDA approved drugs is 1.123 (95% CI -2.362-4.608). The difference between these two samples is not statistically significant (p=0.8)
Similar to the XKCD, we have prepared a table of potential drug names that are very drug-like. Climaxalone is the lowest scored, and also the saddest.
Drug name | Score |
---|---|
Pizzazzanib | 11.40958 |
Marxism | 4.308394 |
Climaxanib | 4.257218 |
Climaxalone | 3.997681 |
Ipaktchian, Susan. The name game, Stanford Medicine Magazine. June 7, 2005. Retrieved from: http://www.igorinternational.com/press/stanford-trade-names-generic-drug-names.php↩