The risks and harms of 3rd party tech platforms in academia

CW: some strong language, description of abuse

Apologies for the rambly nature of this post. I wrote it in airports, partly out of frustration, and I may come back and make it more readable later.

In this post, I’m going to highlight some of the problems that come along with using 3rd party tech companies’ platforms on an institutional level in academia. Tech companies have agendas that are not always compatible with academia, and we have mostly ignored that. Briefly, the core problem with the use of these technologies, and entrenching them into academic life, is that it is an abdication of certain kinds of responsibility. We are giving up control over many of the structures that are necessary to participation in academic work and life, and the people we’re handing the keys to are often hostile to certain members of the academic community, and in a way that is often difficult to see.

I have included a short “too long; didn’t read” at the end of each section, and some potential alternatives.

Using a tech company’s services is risky

There’s an old saying: “There’s no such thing as the cloud; it’s just someone else’s computer.” And it’s true, with all the risks that come associated with using someone else’s computer. The usual response to this is something along the lines of “I don’t care, I have nothing to hide.” But even if that’s true, that isn’t the only reason someone might have for avoiding the use of 3rd party tech companies’ services.

For starters, sometimes tech companies fail on a major scale that could endanger entire projects. Do you remember in 2017 when a bug in Google Docs locked thousands of people out of their own files because they were flagged as a violation of the terms of use?

https://twitter.com/widdowquinn/status/925360317743460352

Or more recently, here’s an example of a guy who got his entire company banned by Google by accident, proving that you can lose everything because of someone else’s actions:

TIFU by getting google to ban our entire company while on the toilet

And of course, this gets worse for members of certain kinds of minorities. Google and Facebook for example, both have a real-names policy, which is hostile to people who are trans, and indigenous North Americans:

https://boingboing.net/2015/02/14/facebook-tells-native-american.html

There are other risks beyond just data loss—for example, if your research involves confidential data, you may even be overstepping the consent of your research subjects, and potentially violating the terms under which your institutional review board granted approval of your study by putting it on a 3rd party server where others can access it. This may also be the case of web apps that include Google Analytics.

tl;dr—If your academic work depends on a 3rd party tech company’s services, you risk: losing your work at a critical time for reasons that have nothing to do with your own conduct; violating research subject consent; and you may be excluding certain kinds of minorities.

Alternatives—In this section, I have mostly focused on data sharing risks. You can avoid using Google Docs and Dropbox by sharing files on a local computer through Syncthing, or by installing an encrypted Nextcloud on a server.

Tech companies’ agendas are often designed to encourage abuse against certain minorities

I have touched on this already a bit, but it deserves its own section. Tech companies have agendas and biases that do not affect everyone equally. For emphasis: technology is not neutral. It is always a product of the people who built it.

For example, I have been on Twitter since 2011. I have even written Twitter bots. I have been active tweeting for most of that time both personally and about my research. And because I am a queer academic, I have been the target of homophobic trolls nearly constantly.

I have received direct messages and public replies to my tweets in which I was told to kill myself, called a “fag,” and in which a user told me he hopes I get AIDS. Twitter also closed my account for a short period of time because someone reported me for using a “slur”—you see, I used the word “queer.” To describe myself. And for this, there was a short period of time in which I was locked out, and it took some negotiation with Twitter support, and the deletion of some of my tweets to get back on.

I was off Twitter for a number of months because of this and out of a reluctance to continue to provide free content to a website that’s run by a guy who periodically retweets content that is sympathetic to white supremacists:

Twitter CEO slammed for retweeting man who is pro-racial profiling

And this isn’t something that’s incidental to Twitter / Facebook that could be fixed. It is a part of their core business model, which is about maximising engagement. And the main way they do that is by keeping people angry and yelling at each other. These platforms exist to encourage abuse, and they are run by people who will never have to endure it. That’s their meal-ticket, so to speak. And most of that is directed at women, members of racial minorities and queer people.

I have been told that if I keep my Twitter account “professional” and avoid disclosing my sexuality that I wouldn’t have problems with abuse. I think the trolls would find me again if I did open a new account, but even if it were the case that I could go back into the closet, at least for professional purposes, there are four reasons why I wouldn’t want to:

  • My experience as a queer academic medical ethicist gives me a perspective that is relevant. I can see things that straight people miss, and I have standing to speak about those issues because of my personal experiences.
  • Younger queer people in academia shouldn’t have to wonder if they’re the only one in their discipline.
  • As a good friend of mine recently noted, it’s unfair to make me hide who I am, while all the straight men all have “professor, father and husband” or the like in their Twitter bio’s.
  • I shouldn’t have to carefully avoid any mention of my boyfriend or my identity in order to participate in academic discussions, on pain of receiving a barrage of abuse from online trolls.

I’m not saying that everyone who uses Twitter or Facebook is bad. But I am extremely uncomfortable about the institutional use of platforms like Google/Facebook/Twitter for academic communications. When universities, journals, academic departments, etc. use them, they are telling us all that this kind of abuse is the price of entry into academic discussions.

tl;dr—Using 3rd-party tech company platforms for academic communications, etc. excludes certain people or puts them in the way of harm, and this disproportionately affects women, members of racial minorities and queer people.

Alternatives—In this section, I have mostly focused on academic communications. For micro-blogging, there is Mastodon, for example (there are even instances for science communication and for academics generally). If you are an institution like an academic journal, a working RSS feed (or several, depending on your volume of publications) is better than a lively Twitter account.

Tech companies are not transparent in their decisions, which often cannot be appealed

Some of the problems with using 3rd party tech company platforms go beyond just the inherent risks in using someone else’s computer, or abuse by other users—in many cases, the use of their services is subject to the whims of their support personnel, who may make poor decisions out of carelessness, a discriminatory policy, or for entirely inscrutable or undisclosed reasons. And because these are private companies, there may be nothing that compels them to explain themselves, and no way to appeal such a decision, leaving anyone caught in a situation like this unable to participate in some aspect of academic life.

For example, in the late 00’s, I tried to make a purchase with Paypal and received an error message. I hadn’t used my account for years, and I thought it was just that my credit card needed to be updated. On visiting the Paypal website, I found that my account had been closed permanently. I assumed this was a mistake that could be resolved, so I contacted Paypal support. They informed me that I had somehow violated their terms of use, and that this decision could not be appealed under any circumstance. The best explanation for this situation that I could ever get from them was, to paraphrase, “You know what you did.”

This was baffling to me, as I hadn’t used Paypal in years and I had no idea what I could have possibly done. I tried making a new account with a new email address. When I connected my financial details to this account, it was also automatically closed. I’ve tried to make a new account a few times since, but never with success. As far as I can tell, there is no way for me to ever have a Paypal account again.

And that wasn’t a problem for me until a few months ago when I tried to register for some optional sessions at an academic conference that my department nominated me to attend. In order to confirm my place, I needed to pay a deposit, and the organizers only provided Paypal (not cash or credit card) as a payment option.

And this sort of thing is not unique to my situation either. Paypal has a long, terrible and well-documented history of arbitrarily closing accounts (and appropriating any money involved). This is usually in connexion with Paypal’s bizarre and sometimes contradictory policies around charities, but this also affects people involved in sex work (reminder: being a sex worker is perfectly legal in Canada).

Everything worked out for me in my particular situation at this conference, but it took work. After several emails, I was eventually able to convince them to make an exception and allow me to pay by cash on arrival, but I still had to go through the process of explaining to them why I have no Paypal account, why I could try making a new one, but it wouldn’t work, and that I wasn’t just being a technophobe or difficult to work with on purpose. I was tempted to just opt out of the sessions because I didn’t want to go through the embarrassment of explaining my situation.

And my problem with Paypal was a “respectable” one—it’s just some weird mistake that I’ve never been able to resolve with Paypal. Now imagine trying to navigate a barrier to academic participation like that if you were a person whose Paypal account was closed because you got caught using it for sex work. Do you think you’d even try to explain that to a conference organizer? Or would you just sit those sessions out?

tl;dr—When you use services provided by tech companies, you may be putting up barriers to entry for others that you are unaware of.

Alternatives—This section was about money, and there aren’t that many good solutions. Accept cash. And when someone asks for special accommodation, don’t ask them to justify it.

Conclusion

Technology isn’t neutral. It’s built by people, who have their own biases, agendas and blind-spots. If we really value academic freedom, and we want to encourage diversity in academic thought, we need to be very critical about the technology that we adopt at the institutional level.

How to get R to parse the <study_design> field from clinicaltrials.gov XML files

Clinicaltrials.gov helpfully provides a facility for downloading machine-readable XML files of its data. Here’s an example of a zipped file of 10 clinicaltrials.gov XML files.

Unfortunately, a big zipped folder of XML files is not that helpful. Even after parsing a whole bunch of trials into a single data frame in R, there are a few fields that are written in the least useful format ever. For example, the <study_design> field usually looks something like this:

Allocation: Non-Randomized, Endpoint Classification: Safety Study, Intervention Model: Single Group Assignment, Masking: Open Label, Primary Purpose: Treatment

So, I wrote a little R script to help us all out. Do a search on clinicaltrials.gov, then save the unzipped search result in a new directory called search_result/ in your ~/Downloads/ folder. The following script will parse through each XML file in that directory, putting each one in a new data frame called “trials”, then it will explode the <study_design> field into individual columns.

So for example, based on the example field above, it would create new columns called “Allocation”, “Endpoint_Classification”, “Intervention_Model”, “Masking”, and “Primary_Purpose”, populated with the corresponding data.

require ("XML")
require ("plyr")

# Change path as necessary
path = "~/Downloads/search_result/"

setwd(path)
xml_file_names <- dir(path, pattern = ".xml")

counter <- 1

# Makes data frame by looping through every XML file in the specified directory
for ( xml_file_name in xml_file_names ) {
  
  xmlfile <- xmlTreeParse(xml_file_name)
  
  xmltop <- xmlRoot(xmlfile)
  
  data <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
  
  if ( counter == 1 ) {
    
    trials <- data.frame(t(data), row.names = NULL)
    
  } else {
    
    newrow <- data.frame(t(data), row.names = NULL)
    trials <- rbind.fill (trials, newrow)
    
  }
  
  # This will be good for very large sets of XML files
  
  print (
    paste0(
      xml_file_name,
      " processed (",
      format(100 * counter / length(xml_file_names), digits = 2),
      "% complete)"
    )
  )
  
  counter <- counter + 1
  
}

# Data frame has been constructed. Comment out the following two loops
# (until the "un-cluttering" part) in the case that you are not interested
# in exploding the <study_design> column.

columns = vector();

for ( stu_des in trials$study_design ) {
  # splits by commas NOT in parentheses
  for (pair in strsplit( stu_des, ", *(?![^()]*\\))", perl=TRUE)) {
    newcol <- substr( pair, 0, regexpr(':', pair) - 1 )
    columns <- c(columns, newcol)
  }
}

for ( newcol in unique(columns) ) {
  
  # get rid of spaces and special characters
  newcol <- gsub('([[:punct:]])|\\s+','_', newcol)
  
  if (newcol != "") {
    
    # add the new column
    trials[,newcol] <- NA
    
    i <- 1
    
    for ( stu_des2 in trials$study_design ) {
      
      for (pairs in strsplit( stu_des2, ", *(?![^()]*\\))", perl=TRUE)) {
        
        for (pair in pairs) {
          
          if ( gsub('([[:punct:]])|\\s+','_', substr( pair, 0, regexpr(':', pair) - 1 )) == newcol ) {
            
            trials[i, ncol(trials)] <- substr( pair, regexpr(':', pair) + 2, 100000 )
            
          }
          
        }
        
      }
      
      i <- i+1
      
    }
    
  }
  
}

# Un-clutter the working environment

remove (i)
remove (counter)
remove (data)
remove (newcol)
remove (newrow)
remove (columns)
remove (pair)
remove (pairs)
remove (stu_des)
remove (stu_des2)
remove (xml_file_name)
remove (xml_file_names)
remove (xmlfile)
remove (xmltop)

# Get nice NCT id's

get_nct_id <- function ( row_id_info ) {
  
  return (unlist (row_id_info) ["nct_id"])
  
}

trials$nct_id <- lapply(trials$id_info, function(x) get_nct_id (x))

# Clean up enrolment field

trials$enrollment[trials$enrollment == "NULL"] <- NA

trials$enrollment <- as.numeric(trials$enrollment)

Useful references:

  • https://www.r-bloggers.com/r-and-the-web-for-beginners-part-ii-xml-in-r/
  • http://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns
  • http://stackoverflow.com/questions/21105360/regex-find-comma-not-inside-quotes

Gotcha! This is why piracy happens

Stata

This summer, I took a two-week long course on systematic reviews and meta-analytic techniques for which there was some required software, in this case, Stata. As a McGill student, I was encouraged to buy the student version, which was about $50 for “Stata Small.” Not bad. I’ve paid more for textbooks. So I got out my credit card, bought the license, installed it on my computer, and ran the very first example command of the course. I immediately got a string of red letter error text.

The error message was telling me that my license did not allow me enough variables to complete the command. I checked the license, and it said I was allowed 120 variables. I checked the “Variable manager” in Stata, and I had only assigned 11 variables. (I checked the variable limit beforehand in fact, and made sure that none of the data sets that we’d be working with had more than 120 variables. None of them came close to that limit.)

So I emailed Stata technical support. It turns out that the meta-analysis package for Stata creates “hidden variables.” Lots of them, apparently. So many that the software cannot accomplish the most basic commands. Then they tried to up-sell me to “Stata SE.” For $100 more, they said, they would send me a license for Stata that would allow me to run the meta-analysis package—for realsies this time.

I asked for a refund and decided that if I really needed Stata, I would use the copy that’s installed on the lab computers. (Now I’m just using the meta package in R, which does everything Stata does, just with a bit more effort.)

For the record: I am perfectly fine with paying for good software. I am not okay with a one-time purchase turning me into a money-pump. I thought that the “small” student license would work. All their documentation suggested it would. If I had upgraded to “Stata SE,” would that have actually met my needs, or would they have forced me to upgrade again later, after I’d already made Stata a part of my workflow?

It probably would have been okay, but the “gotcha” after the fact soured me on the prospect of sending them more money, and provided all the incentive I need to find a way to not use Stata.

iTunes

A few years ago, I bought a number of pieces of classical music through the iTunes Store. I shopped around, compared different performances, and found recordings that I really liked. This was back when the iTunes store had DRM on their music.

I’ve recently switched to Linux, and now much of the music that I legally bought and paid for can’t be read by my computer. Apple does have a solution for me, of course! For about $25, I can subscribe to a service of theirs that will allow me to download a DRM-free version of the music that I already paid for.

This is why I won’t even consider buying television programmes through the iTunes Store: It’s not that I think that I will want to re-watch the shows over and over and I’m afraid of DRM screwing that up for me. It’s because I’ve had some nasty surprises from iTunes in the past, and I can borrow the DVD’s from the Public Library for free.

For the record: I do not mind paying for digital content. But I won’t send you money if I think there’s a “gotcha” coming after the fact.

I’m really trying my best

People who produce good software or music should be compensated for their work. I don’t mind pulling out my wallet to help make that happen. But I don’t want to feel like I’m being tricked, especially if I’m actually making an effort in good faith to actually pay for something.

Since DRM is almost always fairly easily circumvented, it only punishes those who pay for digital content. And this is why I’m sympathetic to those who pirate software, music, TV shows, etc.

Proof of prespecified endpoints in medical research with the bitcoin blockchain

NOTICE (2022-05-24)

This blog post was written in 2014, when I still naively hoped that the myriad problems with cryptocurrency might still be solved. I am now somewhat embarrassed to have written this in the first place, but will leave the post up for historical reasons. (Quite a number of medical journal articles link here now, for better or for worse.)

While the following methods are valid as far as they go, I absolutely DO NOT recommend actually using them to timestamp research protocols. In fact, I recommend that you never use a blockchain for anything, ever.

Introduction

The gerrymandering of endpoints or analytic strategies in medical research is a serious ethical issue. “Fishing expeditions” for statistically significant relationships among trial data or meta-analytic samples can confound proper inference by statistical multiplicity. This may undermine the validity of research findings, and even threaten a favourable balance of patient risk and benefit in certain clinical trials. “Changing the goalposts” for a clinical trial or a meta-analysis when a desired endpoint is not reached is another troubling example of a potential scientific fraud that is possible when endpoints are not specified in advance.

Pre-specifying endpoints

Choosing endpoints to be measured and analyses to be performed in advance of conducting a study is a hallmark of good research practice. However, if a protocol is published on an author’s own web site, it is trivial for an author to retroactively alter her own “pre-specified” goals to align with the objectives pursued in the final publication. Even a researcher who is acting in good faith may find it less than compelling to tell her readers that endpoints were pre-specified, with only her word as a guarantee.

Advising a researcher to publish her protocol in an independent venue such as a journal or a clinical trial registry in advance of conducting research does not solve this problem, and even creates some new ones. Publishing a methods paper is a lengthy and costly process with no guarantee of success—it may not be possible to find a journal interested in publishing your protocol.

Pre-specifying endpoints in a clinical trial registry may be feasible for clinical trials, but these registries are not open to meta-analytic projects. Further, clinical trial registry entries may be changed, and it is much more difficult (although still possible) to download previous versions of trial registries than it is to retrieve the current one. For example, there is still no way to automate downloading of XML-formatted historical trial data from www.clinicaltrials.gov in the same way that the current version of trial data can be automatically downloaded and processed. Burying clinical trial data in the “history” of a registry is not a difficult task.

Publishing analyses to be performed prior to executing the research itself potentially sets up a researcher to have her project “scooped” by a faster or better-funded rival research group who finds her question interesting.

Using the bitcoin blockchain to prove a document’s existence at a certain time

Bitcoin uses a distributed, permanent, timestamped, public ledger of all transactions (called a “blockchain”) to establish which addresses have been credited with how many bitcoins. The blockchain indirectly provides a method for establishing the existence of a document at particular time that can be independently verified by any interested party, without relying on a medical researcher’s moral character or the authority (or longevity) of a central registry. Even in the case that the NIH’s servers were destroyed by a natural disaster, if there were any full bitcoin nodes left running in the world, the method described below could be used to confirm that a paper’s analytic method was established at the time the authors claim.

Method

  1. Prepare a document containing the protocol, including explicitly pre-specified endpoints and all prospectively planned analyses. I recommend using a non-proprietary document format (e.g. an unformatted text file or a LaTeX source file).
  2. Calculate the document’s SHA256 digest and convert it to a bitcoin private key.
  3. Import this private key into a bitcoin wallet, and send an arbitrary amount of bitcoin to its corresponding public address. After the transaction is complete, I recommend emptying the bitcoin from that address to another address that only you control, as anyone given the document prepared in (1) will have the ability to generate the private key and spend the funds you just sent to it.

Result

The incorporation into the blockchain of the first transaction using the address generated from the SHA256 digest of the document provides an undeniably timestamped record that the research protocol prepared in (1) is at least as old as the transaction in question. Care must be taken not to accidentally modify the protocol after this point, since only an exact copy of the original protocol will generate an identical SHA256 digest. Even the alteration of a single character will make the document fail an authentication test.

To prove a document’s existence at a certain point in time, a researcher need only provide the document in question. Any computer would be able to calculate its SHA256 digest and convert to a private key with its corresponding public address. Anyone can search for transactions on the blockchain that involve this address, and check the date when the transaction happened, proving that the document must have existed at least as early as that date.

Discussion

This strategy would prevent a researcher from retroactively changing an endpoint or adding / excluding analyses after seeing the results of her study. It is simple, economical, trustless, non-proprietary, independently verifiable, and provides no opportunity for other researchers to steal the methods or goals of a project before its completion.

Unfortunately, this method would not prevent a malicious team of researchers from preparing multiple such documents in advance, in anticipation of a need to defraud the medical research establishment. To be clear, under a system as described above, retroactively changing endpoints would no longer be a question of simply deleting a paragraph in a Word document or in a trial registry. This level of dishonesty would require planning in advance (in some cases months or years), detailed anticipation of multiple contingencies, and in many cases, the cooperation of multiple members of a research team. At that point, it would probably be easier to just fake the numbers than it would be to have a folder full of blockchain-timestamped protocols with different endpoints, ready in case the endpoints need to be changed.

Further, keeping a folder of blockchain-timestamped protocols would be a very risky pursuit—all it would take is a single honest researcher in the lab to find those protocols, and she would have a permanent, undeniable and independently verifiable proof of the scientific fraud.

Conclusion

Fraud in scientific methods erodes confidence in the medical research establishment, which is essential to it performing its function—generating new scientific knowledge, and cases where pre-specified endpoints are retroactively changed casts doubt on the rest of medical research. A method by which anyone can verify the existence of a particular detailed protocol prior to research would lend support to the credibility of medical research, and be one less thing about which researchers have to say, “trust me.”

Why I dumped Gmail

Reason one: I need my email to work, whether I follow the rules on Google Plus or not

Google has linked so many different products with so many different sets of rules to the same account that I feel like I can’t possibly know when I am breaking some of its terms of use. And I’m not even talking about specifically malicious activity, like using software to scrape information from a Google app or a DDoS attack. I mean something as basic as using a pseudonym on Google Plus, or a teenager revealing that she lied about her age when signing-up for her Gmail account. (These are both things that have brought about the deletion of a Google account, including Gmail.)

For starters, I think it is a dangerous and insensitive policy to require all users to use their real names on the Internet, but putting that aside, I don’t want to risk having all my emails deleted and being unable to contact anyone because of some Murph / Benjamin confusion on Google Plus.

Reason two: it’s actually not okay for Google to read my email

Google never made it a secret that they read everyone’s email. Do you remember when you first started seeing the targeted ads in your Gmail? I bet you called a friend over to look. “Look at this,” you said, “we were just talking about getting sushi tonight, and now there’s an ad for Montréal Sushi in my mailbox! That’s so creepy,” you said.

And then you both laughed. Maybe you made a joke about 1984. Over time, you got comfortable with the fact that Google wasn’t even hiding the fact that they read your mail. Or maybe you never really made the connexion between the ads and the content of your email. Maybe you thought, “I have nothing to hide,” and shrugged it off, or did some mental calculation that the convenience of your Gmail was worth the invasion of privacy.

I guess over time I changed my mind about being okay with it.

And no, this isn’t because I have some huge terrible secret, or because I’m a criminal or anything like that. I just don’t want to send the message that I’m okay with this sort of invasion of privacy anymore. Google’s unspoken challenge to anyone who questions their targeted ads scheme has always been, This the price you pay for a free service like Gmail. If you don’t like it, you can leave.

This is me saying, I don’t like it. I’m leaving.

Reason three: Gmail isn’t even that good anymore

When I signed up for Gmail, there were three things that set it apart:

  1. Tag and archive emails—forget folders!
  2. 10 gigabytes of space—never delete an email again!
  3. Web-based interface—access it from anywhere!

I’ll deal with each of these in turn.

1. Tagging was fun, but it only really works in the Gmail web interface, or in an app specifically designed for use with Gmail. Unfortunately, Gmail just doesn’t play nicely with other email apps, like the one in Mac OS X, or Mail on the iPhone or the BlackBerry. You could make it work through IMAP, having it tell your mail client that each tag was a folder, but it was always a bit screwy, and I never figured out how to put something in two tags through a 3rd-party app or mobile device.

The value of being able to organise emails by simultaneously having them in two categories is outweighed by the fact that I couldn’t access this functionality except through the browser.

2. The amount of space that Gmail provides for emails is not very much these days. I have a website (you may have guessed) and it comes with unlimited disc space for web hosting and emails. 10 gigabytes is just not that big a deal anymore.

3. I can do this with my self-hosted email as well, and I don’t have to suffer through an interface change (“upgrade”) just because Google says so.

So what’s the alternative?

Full disclosure: I haven’t shut down my Google account. I’m forwarding my Gmail to my self-hosted email account, so people who had my old Gmail account can still contact me there for the foreseeable future. I am also still using a number of other Google products, like the Calendar and Google Plus, but my life would not go down in flames quite so quickly if those stopped working as compared to a loss of email access.

Basically, I am moving as many “mission critical” aspects of my life away from Google as I can, to keep my technological eggs in a few more baskets. Email, for example, will be handled by my web host, of which I make backups on a regular basis.

I’m not trying to go cold-turkey on Google. I’m just not going to pretend to be as comfortable as I used to be as a guest on Google’s servers.

Update (2013 Nov 18)

I switched back to the Thunderbird email client a couple weeks ago. It supports tagging and archiving, just like Gmail.

Update (2018)

I switched to Protonmail!

How to automatically back up WordPress or ownCloud using cron jobs

Recently I set up WordPress for my research group in the Medical Ethics Unit. We will be blogging our journal clubs, posting links to our publications and upcoming events. In related news, my research group has been using DropBox to coordinate papers in progress, sharing of raw data, citations, and all manner of other information. This was working pretty well, but we have been bumping up against the upper limit of our capacity on DropBox for a while, so I installed ownCloud on the web host we got for the research group blog. I’m pretty happy with how nice it is to use and administer.

Of course one of our concerns is making sure that we don’t lose any data in the case of the failure of our web host. This is unlikely, but it does happen, and we don’t want to run into a situation where we try to log in to our cloud-based file storage / sharing service and find that months’ worth of research is gone forever.

For a few weeks, the following was more-or-less my workflow for making backups:

  1. Log in to phpMyAdmin
  2. Make a dump file of the WP database (choose database > Export > Save as file … )
  3. Make a dump file of the ownCloud database
  4. Save to computer and label with appropriate date
  5. Log in to web server using FTP
  6. Copy contents of WP’s /wp-content/ to a date-labelled folder on my computer
  7. Copy contents of ownCloud’s /data/ to a date-labelled folder on my computer

This worked pretty well, except that it was a pain for me to have to do this every day, and I know that if I ever forgot to do it, that would be when something terrible happened. Fortunately for me, my boss mentioned that he had an old but still serviceable iMac sitting in his office that he wanted to put to some good purpose.

I decided to make a fully automatic setup that would make backups of our remotely hosted data and save it locally without any input on my part, so I can just forget about it. I made it with cron jobs.

Server side cron jobs

First, I set up some cron jobs on the server side. The first one waits until midnight every day, then dumps all the MySQL databases into a gzipped file on my web host, then zips up the WordPress /wp-content/ and ownCloud /data/ folders and puts them in the backup folder as well. The second server-side cron job empties the backup folder every day at 23h00.

  • 0 0 * * * PREFIX=`date +%y-%m-%d`; mysqldump -u USERNAME -h HOSTNAME -pPASSWORD –all-databases | gzip > /path/to/backup/folder/${PREFIX}-DBNAME-db.sql.gz; zip -r /path/to/backup/folder/${PREFIX}-wordpress-files.zip /path/to/wordpress/wp-content/; zip -r /path/to/backup/folder/${PREFIX}-owncloud-files.zip /path/to/owncloud/data/;
  • 0 23 * * * rm -r /path/to/backup/folder/*

A few notes for someone trying to copy this set-up

  • Your web host might be in a different time zone, so you might need to keep that in mind when coordinating cron jobs on your web host with ones on a local machine.
  • My web host provided a cron job editor that automatically escapes special characters like %, but you might have to add back-slashes to make yours work if you’re manually editing with crontab -e.
  • You might want to put a .htaccess file in your backup directory with the following in it: “Options -Indexes” (remove the quotes of course). This stops other people from going to your backup directory in a browser and helping themselves to your files. You could also name your backup directory with a random hash of letters and numbers if you wanted to make it difficult for people to steal your backed-up data.

Local cron job

Then on the local machine, the old iMac, I set up the following cron job. It downloads the files and saves them to a folder on an external hard disc every day at 6h00.

  • 0 6 * * * PREFIX=`date +%y-%m-%d`; curl http://www.your-web-site.com/back-up/${PREFIX}-DBNAME-db.sql.gz > /Volumes/External HD/Back-ups/${PREFIX}-DBNAME-db.sql.gz; curl http://www.your-web-site.com/back-up/${PREFIX}-wordpress-files.zip > /Volumes/External HD/Back-ups/${PREFIX}-wordpress-files.zip; curl http://www.your-web-site.com/back-up/${PREFIX}-owncloud-files.zip > /Volumes/External HD/Back-ups/${PREFIX}-owncloud-files.zip;

If you were super-paranoid about losing data, you could install this on multiple local machines, or you change the timing so that the cron jobs run twice a day, or as often as you liked, really. As long as they’re always turned on, connected to the internet and they have access to the folder where the backups will go, they should work fine.

Stoop-n-scoop

This isn’t a super-secure way to back up your files, but then we’re more worried about losing data accidentally than having it stolen maliciously. I don’t think the world of medical ethics is cut-throat enough that our academic rivals would stoop to stealing our data in an effort to scoop our papers before we can publish them. That said, I’m not about to give away the exact URL where our backups are stored, either.

The practical upshot of all this is that now we have at least three copies of any file we’re working on. There’s one on the computer being used to edit the document, there’s one stored remotely on our web host, and there’s a copy of all our files backed up once a day on the old iMac at the Medical Ethics Unit.

Internet vigilante justice against the police in Montréal through social media

I hate Instagram too, but arresting someone for using it is ridiculous
I hate Instagram too, but arresting someone for using it is ridiculous

It’s hard to trust the police in Montréal these days. “Officer 728” is a household name, known for her abuse of power, which was caught on video. There was also a famous CCTV video of a prostrate man being brutally kicked repeatedly by the Montréal police. This problem isn’t restricted to Montréal either. Recently a police officer in Vancouver was caught on video punching a cyclist in the face while putting him in handcuffs.

Technology and the abuse of police power

I used to largely dismiss reports of police abuses of power. When I saw graffiti saying, “eff the police” or something to that effect, I used to chalk it up to conspiracy theorists and delinquent youths. Now that it’s all on Youtube, it’s harder to ignore the problem.

(I also used to dismiss those who spray-painted “burn the banks” in a number of parts of Montréal as conspiracy theorists, but since 2008, I can kind of see where they’re coming from.)

We’re entering into an age when abuses of power by police are being caught on tape more and more often. I don’t think that police abusing their power is a new thing, or even that the rates have changed recently. I’m of the position that it might just be more visible because of the recent development that nearly everyone is carrying around a camera in their pocket that can instantly upload video of police brutality to Youtube. The Google Glass project (and the clones that are sure to follow) may make this even more common.

This is unsettling to me, partly because it might mean that a lot of the times I dismissed claims of police abuse, I was in the wrong.

We should all be legitimately outraged by this

More importantly though, this should make us all angry because this is not how justice works in Canada. Even if the robbery suspect was completely guilty of every crime the police suspected, we don’t allow individual police officers to dole out their own personal vengeance in the form of physical beatings. We certainly don’t allow groups of police officers to do so against suspected criminals as they lie helpless in the snow, and most emphatically, there is no place in Canadian justice for criminals to be punished in this way (or any other) without due process or without even having been formally charged with a crime.

A police officer punching a restrained person is much worse than a regular citizen punching another citizen. This is because the police are, so to speak, the final guarantee that the government has power over its citizens and that there is the rule of law in a country. The most basic reason for others not to steal your stuff is that if they do, there’s a good chance that the police will come and take away their freedom in such a way that it’s not worth it for most people to engage in that behaviour. All laws largely work on the same principle. Sure, there’s other sanctions that a government can use, like taxation, but even that is underwritten by the threat of police coming and putting you in prison if you break the tax laws.

So, when a police officer physically abuses a citizen, he shakes our faith in the proper functioning of the machinery of government. This makes the issue not just one of bad PR for a particular police department, but one of general faith in our country to work in a just and equitable way. Further, if the police are vigilantes and there is no recourse, it legitimizes vigilante justice by the people against the police.

This means that when a police officer abuses his power, there must be some recourse that is transparent, timely and just. There can’t even be the appearance that the police are above the law, otherwise what you will see is ordinary citizens taking the law into their own hands to bring the police to justice, which is a very scary prospect.

Ordinary citizens are taking the law into their own hands to bring the police to justice

In response to the issues I have described above, as well as a number of much less famous examples of abuse of police power during the protests in Montréal, there has been a movement toward the use of social media to identify the police who are abusing their power. This is being done by citizens who believe that there has been abuse of power by police in Montréal, and that the normal channels of addressing these abuses have been of no avail.

They are collecting photos, videos, identification numbers, names and addresses of police officers, cataloguing their transgressions and calling for retribution.

The police are calling this “intimidation.” They are calling for it to be taken down. They’re (rightly) complaining that there is no way for a police officer who is wrongly accused in this way to clear his name, and that the police, and even some non-police are being put in danger because of this.

What needs to happen

I have not been involved in the student protests in Montréal. I have never been beaten by the police. I generally believe that if I call 911, thanks to my skin colour, it will be the “good guys” who show up at my door. That said, I can understand why someone who was abused by a police officer might be tempted to post this information out of frustration at the ineffectiveness of the official recourse against such abuse.

In some ways, the police have been implicitly training us to use these methods if we want anything to get done: Likely the police officer from Vancouver would have gotten away with punching the cyclist in the face if the cyclist’s friend hadn’t caught it on video and posted it to Youtube.

If the police want us to use official channels to address police abuses, they have to give us reason to think that it’s better to do that than to just appeal to the Internet for justice. Politically-motivated arrests of people for posting “intimidating” things online won’t cut it.

I think we will only see a real change in public attitudes toward police brutality given the following three conditions.

  1. The official channels must be transparent. It must be clear to everyone that something is being done, and we have to see that officers who abuse their power are appropriately punished. Confidence in the relationship between the state and its citizens is what’s at stake, and so the solution must be one that publicly restores confidence.
  2. Official channels must be timely. The old adage, “justice delayed is justice denied” applies here. If citizens get impatient waiting for justice to be applied, they may be tempted to take it into their own hands.
  3. Finally, official recourse against police abuse must be just. This is where an official means of recourse against police brutality could actually outdo Internet vigilantes. Internet vigilante justice will always be faster and more transparent than anything official could ever be, but an official channel can enforce punishments fitting to the crime, and can claim legitimacy in a way that vigilantes never can.

If a police officer publicly faced criminal charges, rather than just a “paid leave of absence” followed by “counselling” and this happened in short order after an accusation of abuse, this would do a lot to restore faith in official channels. The people of Montréal might even learn that the legitimate checks and balances are preferable to pursuing vigilante justice through social media.

Conventional computing vs the corporate cloud vs the “personal” cloud

Everyone loves cloud computing. Users love it, tech blogs love it, and tech companies are all trying their hand at it—even ones who have no concept of how to provide a half-decent web service. And yes, I’m talking about Apple’s iTools. I mean, dot-Mac. Oh sorry, it’s called iCloud now. Whatever it’s called, it’s still terrible.

More interesting to me than the corporate offerings of cloud-based services (and in some cases withdrawals of those offerings, e.g. Google Reader) is all the new open-source cloud-based software available for anyone to install on their own web host of choice. To clarify, I’m talking about pieces of software that are more like WordPress than Microsoft Word—this is software that you install on a web server, and that you access through a browser, not software that you install on your own home computer. I will refer to this type of software as “personal” cloud software.

Here are a few examples of different categories of software, and rough equivalents for conventional computing, corporate cloud offerings and “personal” cloud alternatives. This is not meant to be a comprehensive list of such services, just a list of examples. Also, the examples given here aren’t meant to be endorsements of the services either.

  Conventional computing Corporate cloud “Personal” cloud
Document editors Microsoft Word
OpenOffice
Pages
Google Docs
Microsoft Web Office
OX Documents?
WordPress (sort of?)
Email Outlook
Thunderbird
Mail.app
Gmail
Hotmail
Yahoo Mail
Squirrelmail, etc.
Note-keeping Any text editor, really Evernote
Notes.app
Google Keep
OwnCloud
Photos iPhoto
Lightroom
Aperture
Flickr
G+ / FB
OpenPhoto
File storage Hard disc Dropbox
Google Drive
OwnCloud
Music iTunes / iPod Your favourite music streaming service
Youtube
OwnCloud
RSS reader Newsfire, etc. Google Reader (hahaha)
Feedly
Selfoss

Usually the debate is framed as being between conventional computing and corporate cloud computing. Sometimes a very nuanced look into these different services will compare different corporate cloud-based services, but rarely does anyone compare the pros and cons of conventional vs corporate cloud vs “personal” cloud services. So, as far I see them, the following are the major issues to consider. Depending on your own level of technical expertise, your priorities, budget and the level of importance that you assign to a particular task that you wish to perform, you may weight these differently. For simplicity, I assigned each category a value of +1 (this is good), -1 (this is bad) or 0 (this isn’t very good or very bad).

  Conventional computing Corporate cloud “Personal” cloud
Who has access to your files? Only you (+1) You, corporation, NSA (-1) You, web host (0)
Who owns the software? You own a licence (0) Corporation (-1) Often open source (+1)
When do you pay? Only once—when you buy the software (0) Never (+1) Every month (-1)
Can a company mine your data for advertising info? No (+1) Yes (-1) No (+1)
Are there advertisements? No (+1) Often, yes (-1) No (+1)
Accidentally losing files? Very possible (-1) Unlikely (+1) Unlikely (+1)
Rolling back to previous versions? Only if you make backups (0) Often yes (+1) Often yes (+1)
Open source software? Sometimes (0) No (-1) Almost always (+1)
Level of technical expertise required to install software? Medium (0) Low (+1) High (-1)
Can the whole service be “Google Reader-ed”? No, but development of your app might be cancelled (0) Yes (-1) No (+1)
Whose computer must be working for you to access your files, etc.? Only yours (+1) The corporation’s (-1) Your web host’s (-1)
Can you collaborate with other users? Not really (unless you count “track changes”) (-1) Yes (+1) Yes (+1)
Accessing / syncing content across multiple devices No (-1) Yes (+1) Yes (+1)
Security depends on whom? You (+1) Corporation (-1) You + web host + software developer (-1)
Is your work available when the internet goes down? Yes (+1) No (-1) No (-1)

If you aren’t scared off by MySQL databases or PHP, the “level of technical expertise” row might be scored differently, or if you doubt your own ability to keep your files secure, you might think that your work’s security depending on Google is a good thing. Haggling over the pros and cons aside, it’s a kind of an interesting result of this exercise that unless you’re really scared of losing work, or unless multi-user collaboration is very important to you, you might be better off avoiding cloud services entirely.

Another interesting result: if it comes down to a choice between a corporate cloud service and a “personal” cloud service, it looks like the “personal” cloud is the way to go—it beats the corporate cloud on every category except price and ease of installation. (And also possibly security.)

Edit (2013 Apr 6): I have added a row for “accessing content across multiple devices.” (Thanks Morty!)

Edit (2013 June 15): In light of recent revelations regarding the NSA’s surveillance, I have added them to the row for “Who has access to your files?”

The Kübler-Ross stages of grief and an open-source solution to the death of Google Reader

Over the past week, I was actually in the middle of writing a blog post about how I sometimes toy with the idea of switching to Ubuntu, just so that my technological life is not entirely beholden to any particular company’s corporate whims. I didn’t quite finish that post before Google very famously killed off its well-loved news aggregator, Google Reader. Most users of Google Reader are going through the classic Kübler-Ross stages of grief:

  1. We all experienced the initial shock and denial. (“What? There is no way they’re shutting Google Reader down.”)
  2. Anger followed.
  3. Then the bargaining.
  4. Next people will get sad about it. They probably won’t blog sad things about Google Reader, though, out of fear of looking pathetic.
  5. As far as acceptance goes, lots of people are now trying to profit from this, by selling their own alternatives to Google Reader. Digg has decided to make building a new aggregator a priority. Users are largely scrambling to find another reader.

My solution to the Google Reader problem

I used to use Newsfire before I switched to Google Reader, but in the time that has elapsed since then, they started charging $5 for it. That’s not a lot, but then I was getting Google Reader for free, so I kept looking. Besides, Newsfire is a newsreader that’s all stored locally on my computer, and my ideal solution would be cloud-based.

I looked around at the currently-available web offerings, and I couldn’t find any that were very appealing. I nearly despaired myself, when I found an open-source web-based solution.

This won’t work for everyone, but it will work for anyone who already has access to a web server with the following capabilities:

  • Apache
  • MySQL
  • PHP
  • Cron jobs

I installed a copy of the open-source RSS reader, selfoss on my web server, and I have been using it instead of Google Reader. I’m pretty happy with it. I’ve had to make a few changes already, but it seems like a good solution to the problem. Here are the advantages, as I see it:

  • Web-based, so it will work on all my devices
  • It’s hosted on my own server, so it will work as long as I keep paying my hosting bill
  • The software won’t be “updated” (read: altered arbitrarily) unless I want it to be
  • No one will decide later that there needs to be ads on my news reader

Good luck in finding a solution to your Google Reader problem!

Borrowing e-books from the library

I’m currently reading The Handmaid’s Tale by Margaret Atwood. I borrowed the e-book from the Québec National Library. Just the process of borrowing an e-book has been fascinating. When an e-book is borrowed from the library, it is no longer available for other users to borrow, because the library uses a particular kind of DRM software.

This is interesting to me because traditional borrowing of library books had the “scarcity” of the books (and thus the protection of the author/publisher’s rights) built-in to the “hardware” itself. That is to say, by the nature of the physical book itself, two people could not be borrowing it from the library at the same time.

This is manifestly not true of digital materials. Much to the chagrin of publishers of all types, it’s difficult to stop people from sharing media if it’s digital, and in fact it takes a good deal of effort to stop people from doing so, while still allowing for legitimate uses of the media in question.

I’m 67% of the way through, and I’ve come across a couple typos. Nothing major—nothing that changes the content of the book, or even makes it much more difficult to read. I don’t know why, but I can’t resist keeping a record of when I find typos.

  • “It isn’t the sort ofthing you ask questions about …” p. 29
  • “I press my hands against the sides of my thighs, breath in, set out along the hall …” p. 142

Maybe I’m reading too much between the lines here, but when I saw these typos, I started thinking about maps. Stay with me, here. I don’t know if it’s actually true, but it used to be said that map-makers would put fake streets—small ones that no one would notice—into their maps, so that if someone copied their work, they would know that it was copied.

I’m sure it’s possible to find software that will strip an e-book of its DRM, and so I wonder if these typos are like that—little “fake streets” that the publisher has inserted into the e-book, so that if it’s copied, they’ll know. If they were sophisticated about it, they could probably even make up a way of encoding which library and even which user stripped the DRM by inserting particular “typos” into the borrowed e-book.

So here’s my question for all you Margaret Atwood fans out there: Does anyone have a physical copy of The Handmaid’s Tale? If you do, can you tell me if the typos are there in your copy? Also, does anyone else feel like borrowing the e-book from the library to see if the typos are there (or in the same place)?

Side-note: How long before we drop the hyphen from “e-book” and “e-reader” the way we dropped the hyphen from e-mail?