How to get R to parse the <study_design> field from clinicaltrials.gov XML files

Clinicaltrials.gov helpfully provides a facility for downloading machine-readable XML files of its data. Here’s an example of a zipped file of 10 clinicaltrials.gov XML files.

Unfortunately, a big zipped folder of XML files is not that helpful. Even after parsing a whole bunch of trials into a single data frame in R, there are a few fields that are written in the least useful format ever. For example, the <study_design> field usually looks something like this:

Allocation: Non-Randomized, Endpoint Classification: Safety Study, Intervention Model: Single Group Assignment, Masking: Open Label, Primary Purpose: Treatment

So, I wrote a little R script to help us all out. Do a search on clinicaltrials.gov, then save the unzipped search result in a new directory called search_result/ in your ~/Downloads/ folder. The following script will parse through each XML file in that directory, putting each one in a new data frame called “trials”, then it will explode the <study_design> field into individual columns.

So for example, based on the example field above, it would create new columns called “Allocation”, “Endpoint_Classification”, “Intervention_Model”, “Masking”, and “Primary_Purpose”, populated with the corresponding data.

require ("XML")
require ("plyr")

# Change path as necessary
path = "~/Downloads/search_result/"

setwd(path)
xml_file_names <- dir(path, pattern = ".xml")

counter <- 1

# Makes data frame by looping through every XML file in the specified directory
for ( xml_file_name in xml_file_names ) {
  
  xmlfile <- xmlTreeParse(xml_file_name)
  
  xmltop <- xmlRoot(xmlfile)
  
  data <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
  
  if ( counter == 1 ) {
    
    trials <- data.frame(t(data), row.names = NULL)
    
  } else {
    
    newrow <- data.frame(t(data), row.names = NULL)
    trials <- rbind.fill (trials, newrow)
    
  }
  
  # This will be good for very large sets of XML files
  
  print (
    paste0(
      xml_file_name,
      " processed (",
      format(100 * counter / length(xml_file_names), digits = 2),
      "% complete)"
    )
  )
  
  counter <- counter + 1
  
}

# Data frame has been constructed. Comment out the following two loops
# (until the "un-cluttering" part) in the case that you are not interested
# in exploding the <study_design> column.

columns = vector();

for ( stu_des in trials$study_design ) {
  # splits by commas NOT in parentheses
  for (pair in strsplit( stu_des, ", *(?![^()]*\\))", perl=TRUE)) {
    newcol <- substr( pair, 0, regexpr(':', pair) - 1 )
    columns <- c(columns, newcol)
  }
}

for ( newcol in unique(columns) ) {
  
  # get rid of spaces and special characters
  newcol <- gsub('([[:punct:]])|\\s+','_', newcol)
  
  if (newcol != "") {
    
    # add the new column
    trials[,newcol] <- NA
    
    i <- 1
    
    for ( stu_des2 in trials$study_design ) {
      
      for (pairs in strsplit( stu_des2, ", *(?![^()]*\\))", perl=TRUE)) {
        
        for (pair in pairs) {
          
          if ( gsub('([[:punct:]])|\\s+','_', substr( pair, 0, regexpr(':', pair) - 1 )) == newcol ) {
            
            trials[i, ncol(trials)] <- substr( pair, regexpr(':', pair) + 2, 100000 )
            
          }
          
        }
        
      }
      
      i <- i+1
      
    }
    
  }
  
}

# Un-clutter the working environment

remove (i)
remove (counter)
remove (data)
remove (newcol)
remove (newrow)
remove (columns)
remove (pair)
remove (pairs)
remove (stu_des)
remove (stu_des2)
remove (xml_file_name)
remove (xml_file_names)
remove (xmlfile)
remove (xmltop)

# Get nice NCT id's

get_nct_id <- function ( row_id_info ) {
  
  return (unlist (row_id_info) ["nct_id"])
  
}

trials$nct_id <- lapply(trials$id_info, function(x) get_nct_id (x))

# Clean up enrolment field

trials$enrollment[trials$enrollment == "NULL"] <- NA

trials$enrollment <- as.numeric(trials$enrollment)

Useful references:

  • https://www.r-bloggers.com/r-and-the-web-for-beginners-part-ii-xml-in-r/
  • http://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns
  • http://stackoverflow.com/questions/21105360/regex-find-comma-not-inside-quotes

Published by

The Grey Literature

This is the personal blog of Benjamin Gregory Carlisle PhD. Queer; Academic; Queer academic. "I'm the research fairy, here to make your academic problems disappear!"

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.