Clinicaltrials.gov helpfully provides a facility for downloading machine-readable XML files of its data. Here’s an example of a zipped file of 10 clinicaltrials.gov XML files.
Unfortunately, a big zipped folder of XML files is not that helpful. Even after parsing a whole bunch of trials into a single data frame in R, there are a few fields that are written in the least useful format ever. For example, the <study_design> field usually looks something like this:
Allocation: Non-Randomized, Endpoint Classification: Safety Study, Intervention Model: Single Group Assignment, Masking: Open Label, Primary Purpose: Treatment
So, I wrote a little R script to help us all out. Do a search on clinicaltrials.gov, then save the unzipped search result in a new directory called search_result/ in your ~/Downloads/ folder. The following script will parse through each XML file in that directory, putting each one in a new data frame called “trials”, then it will explode the <study_design> field into individual columns.
So for example, based on the example field above, it would create new columns called “Allocation”, “Endpoint_Classification”, “Intervention_Model”, “Masking”, and “Primary_Purpose”, populated with the corresponding data.
require ("XML") require ("plyr") # Change path as necessary path = "~/Downloads/search_result/" setwd(path) xml_file_names <- dir(path, pattern = ".xml") counter <- 1 # Makes data frame by looping through every XML file in the specified directory for ( xml_file_name in xml_file_names ) { xmlfile <- xmlTreeParse(xml_file_name) xmltop <- xmlRoot(xmlfile) data <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) if ( counter == 1 ) { trials <- data.frame(t(data), row.names = NULL) } else { newrow <- data.frame(t(data), row.names = NULL) trials <- rbind.fill (trials, newrow) } # This will be good for very large sets of XML files print ( paste0( xml_file_name, " processed (", format(100 * counter / length(xml_file_names), digits = 2), "% complete)" ) ) counter <- counter + 1 } # Data frame has been constructed. Comment out the following two loops # (until the "un-cluttering" part) in the case that you are not interested # in exploding the <study_design> column. columns = vector(); for ( stu_des in trials$study_design ) { # splits by commas NOT in parentheses for (pair in strsplit( stu_des, ", *(?![^()]*\\))", perl=TRUE)) { newcol <- substr( pair, 0, regexpr(':', pair) - 1 ) columns <- c(columns, newcol) } } for ( newcol in unique(columns) ) { # get rid of spaces and special characters newcol <- gsub('([[:punct:]])|\\s+','_', newcol) if (newcol != "") { # add the new column trials[,newcol] <- NA i <- 1 for ( stu_des2 in trials$study_design ) { for (pairs in strsplit( stu_des2, ", *(?![^()]*\\))", perl=TRUE)) { for (pair in pairs) { if ( gsub('([[:punct:]])|\\s+','_', substr( pair, 0, regexpr(':', pair) - 1 )) == newcol ) { trials[i, ncol(trials)] <- substr( pair, regexpr(':', pair) + 2, 100000 ) } } } i <- i+1 } } } # Un-clutter the working environment remove (i) remove (counter) remove (data) remove (newcol) remove (newrow) remove (columns) remove (pair) remove (pairs) remove (stu_des) remove (stu_des2) remove (xml_file_name) remove (xml_file_names) remove (xmlfile) remove (xmltop) # Get nice NCT id's get_nct_id <- function ( row_id_info ) { return (unlist (row_id_info) ["nct_id"]) } trials$nct_id <- lapply(trials$id_info, function(x) get_nct_id (x)) # Clean up enrolment field trials$enrollment[trials$enrollment == "NULL"] <- NA trials$enrollment <- as.numeric(trials$enrollment)
Useful references:
- https://www.r-bloggers.com/r-and-the-web-for-beginners-part-ii-xml-in-r/
- http://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns
- http://stackoverflow.com/questions/21105360/regex-find-comma-not-inside-quotes