How to get R to parse the <study_design> field from XML files

by helpfully provides a facility for downloading machine-readable XML files of its data. Here’s an example of a zipped file of 10 XML files.

Unfortunately, a big zipped folder of XML files is not that helpful. Even after parsing a whole bunch of trials into a single data frame in R, there are a few fields that are written in the least useful format ever. For example, the <study_design> field usually looks something like this:

Allocation: Non-Randomized, Endpoint Classification: Safety Study, Intervention Model: Single Group Assignment, Masking: Open Label, Primary Purpose: Treatment

So, I wrote a little R script to help us all out. Do a search on, then save the unzipped search result in a new directory called search_result/ in your ~/Downloads/ folder. The following script will parse through each XML file in that directory, putting each one in a new data frame called “trials”, then it will explode the <study_design> field into individual columns.

So for example, based on the example field above, it would create new columns called “Allocation”, “Endpoint_Classification”, “Intervention_Model”, “Masking”, and “Primary_Purpose”, populated with the corresponding data.

require ("XML")
require ("plyr")

# Change path as necessary
path = "~/Downloads/search_result/"

xml_file_names <- dir(path, pattern = ".xml")

counter <- 1

# Makes data frame by looping through every XML file in the specified directory
for ( xml_file_name in xml_file_names ) {
  xmlfile <- xmlTreeParse(xml_file_name)
  xmltop <- xmlRoot(xmlfile)
  data <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
  if ( counter == 1 ) {
    trials <- data.frame(t(data), row.names = NULL)
  } else {
    newrow <- data.frame(t(data), row.names = NULL)
    trials <- rbind.fill (trials, newrow)
  # This will be good for very large sets of XML files
  print (
      " processed (",
      format(100 * counter / length(xml_file_names), digits = 2),
      "% complete)"
  counter <- counter + 1

# Data frame has been constructed. Comment out the following two loops
# (until the "un-cluttering" part) in the case that you are not interested
# in exploding the <study_design> column.

columns = vector();

for ( stu_des in trials$study_design ) {
  # splits by commas NOT in parentheses
  for (pair in strsplit( stu_des, ", *(?![^()]*\\))", perl=TRUE)) {
    newcol <- substr( pair, 0, regexpr(':', pair) - 1 )
    columns <- c(columns, newcol)

for ( newcol in unique(columns) ) {
  # get rid of spaces and special characters
  newcol <- gsub('([[:punct:]])|\\s+','_', newcol)
  if (newcol != "") {
    # add the new column
    trials[,newcol] <- NA
    i <- 1
    for ( stu_des2 in trials$study_design ) {
      for (pairs in strsplit( stu_des2, ", *(?![^()]*\\))", perl=TRUE)) {
        for (pair in pairs) {
          if ( gsub('([[:punct:]])|\\s+','_', substr( pair, 0, regexpr(':', pair) - 1 )) == newcol ) {
            trials[i, ncol(trials)] <- substr( pair, regexpr(':', pair) + 2, 100000 )
      i <- i+1

# Un-clutter the working environment

remove (i)
remove (counter)
remove (data)
remove (newcol)
remove (newrow)
remove (columns)
remove (pair)
remove (pairs)
remove (stu_des)
remove (stu_des2)
remove (xml_file_name)
remove (xml_file_names)
remove (xmlfile)
remove (xmltop)

# Get nice NCT id's

get_nct_id <- function ( row_id_info ) {
  return (unlist (row_id_info) ["nct_id"])

trials$nct_id <- lapply(trials$id_info, function(x) get_nct_id (x))

# Clean up enrolment field

trials$enrollment[trials$enrollment == "NULL"] <- NA

trials$enrollment <- as.numeric(trials$enrollment)

Useful references:



    title = {How to get R to parse the <study_design> field from XML files},
    journaltitle = {The Grey Literature},
    author = {Benjamin Gregory Carlisle},
    address = {Montreal, Canada},
    date = 2016-10-6,
    url = {}


Carlisle, Benjamin Gregory. "How to get R to parse the <study_design> field from XML files" Web blog post. The Grey Literature. 06 Oct 2016. Web. 20 Feb 2017. <>


Carlisle, Benjamin Gregory. (2016, Oct 06). How to get R to parse the <study_design> field from XML files [Web log post]. Retrieved from

Leave a Reply

Your email address will not be published. Required fields are marked *


A word from our sponsors

Tag bag

Recent comments

Old posts

All content © Benjamin Carlisle