Introduction

A recent XKCD cartoon featured a comparison of letter frequencies in car names vs in general English text. A similar analysis of drug names, inspired by this cartoon has been performed.

Methods

The FDA database of approved drugs was downloaded on September 1, 2015, and the Product.txt file was parsed using R version 3.2.2.

Chemical names and suffixes were removed (e.g. cypionate, methylbromide, monosulfate, HCL, trisilicate, iodide, chlorpheniramine, estradiol, methylsulfate, calcium, malate, bicarbonate, ethanolate, ethnolate, tartrate, dimesylate, hydrobromide, disodium, mesylate, fumarate, succinate, peroxide, potassium, hydrochloride, sulfate, sodium, trihydrate, stearate, acetate, phosphate, maleate) as well as other non-drug-name modifiers (e.g. “preservative free,” “allergy relief,” plus, injection, preserved, tablets, kit, etc.).

English letter frequencies were taken from the Department of Math at Cornell.

Results

The log of the ratio of the frequency of a letter’s occurrence in a drug name to a letter’s frequency in English text is plotted below.

The letters that occur most disproportionately frequently are: Z, X, Q, P, C, L.

Letter English frequency (%) Drug name frequency (%) Log of relative frequency
Z 0.07 1.10 2.75
X 0.17 1.51 2.19
Q 0.11 0.262 0.867
P 1.82 3.65 0.697
C 2.71 4.23 0.445
L 3.98 6.19 0.443

The letters that occur most disproportionately infrequently are: W, H, G, S, F, K. This may be because the letters H, J, K and W “do not exist in some of the 130 countries that use U.S. generic names, or have different sounds in various languages.”1

Letter English frequency (%) Drug name frequency (%) Log of relative frequency
W 2.09 0.189 -2.40
H 5.92 2.04 -1.06
G 2.03 1.19 -0.534
S 6.28 4.02 -0.446
F 2.30 1.52 -0.416
K 0.69 0.457 -0.413

The frequencies of vowels in drug names vs general English text are fairly close. The most “drug-like” vowel is I (0.273) and the least “drug-like” vowel is U (-0.213).

Letter English frequency (%) Drug name frequency (%) Log of relative frequency
I 7.31 9.61 0.273
A 8.12 9.18 0.123
O 7.68 8.07 0.050
E 12.0 10.7 -0.119
U 2.88 2.33 -0.213

Discussion

Scoring selected cancer drug names

If we take ten times the logarithm of the ratio of drug-name letter frequencies to English word frequencies to be a letter’s “score,” we can assign a score to any given word for how “drug-like” a drug’s name is. A randomly selected sample of 10 cancer drugs that were approved or reached phase 3 testing between 2005-2009 was scored.

Drugs are ordered by score, from highest (most “drug-like”) to lowest (least “drug-like”).

Drug name Score
Arzoxifene 5.027914
Ixabepilone 3.508626
Lapatinib 1.843902
Nilotinib 1.262482
Imatinib 1.261169
Romidepsin 1.175312
Prinomastat 0.731445
Marimastat 0.4234099
Sunitinib -0.01687164
Sorafenib -0.3061702

A set of 10 randomly selected English words was prepared and scored using the same method.

English words are ordered by score, from highest (most “drug-like”) to lowest (least “drug-like”).

English word Score
Material 1.322378
Happen 0.7109685
Locket 0.1319142
Motionless -0.08807015
Helpful -0.3304302
Damage -0.4330361
Change -1.758826
Trust -2.213186
Sandwich -3.959021
Though -5.253664

Approved vs non-approved drugs

Cancer drugs approved by the FDA between 2005-2009
Drug name Score
Ixabepilone 3.508626
Lapatinib 1.843902
Nilotinib 1.262482
Romidepsin 1.175312
Nelarabine 1.077041
Decitabine 0.4133057
Temsirolimus 0.3078753
Sunitinib -0.01687164
Dasatinib -0.1284099
Sorafenib -0.3061702
Cancer drugs that reached phase 3 testing between 2005-2009 but were never approved by the FDA
Drug name Score
Arzoxifene 5.027914
Semaxanib 2.812812
Nolatrexed 2.34507
Irofulven 0.7413801
Prinomastat 0.731445
Rubitecan 0.4906218
Marimastat 0.4234099
Rebimistat -0.04639386
Atamestane -0.5360683
Affinitak -0.759761

The mean for FDA approved drugs is 0.914 (95% CI -1.337-3.165) and the mean for non-FDA approved drugs is 1.123 (95% CI -2.362-4.608). The difference between these two samples is not statistically significant (p=0.8)

Just for fun

Similar to the XKCD, we have prepared a table of potential drug names that are very drug-like. Climaxalone is the lowest scored, and also the saddest.

Good potential cancer drug names based on letter frequencies
Drug name Score
Pizzazzanib 11.40958
Marxism 4.308394
Climaxanib 4.257218
Climaxalone 3.997681

  1. Ipaktchian, Susan. The name game, Stanford Medicine Magazine. June 7, 2005. Retrieved from: http://www.igorinternational.com/press/stanford-trade-names-generic-drug-names.php