Building A Twitter Bot | Work In Progress

"Here's a novel idea, what if there was a twitter bot that used some basic machine learning processes to emulate being a real human. No one has ever done this before. Maybe. Well maybe a few times. Okay, it has been a thing for a while. It may not be 100% an original idea, but it's definitely something that I haven't done before and it seemed like a fun little project.

I know I missed the boat on all the "I forced a bot to watch 1000 hours of _____" memes, which by the way were most likely all not really done with no human intervention, Janelle Shane did a wonderful breakdown of those. However, I realized when I was reading to tweets from @IvePetThatDog that the tweets followed a pretty standard format, which means they could probably be generated fairly easily.

At first I wanted to use a recurrent neural network to generate the tweets. And I still might do this. But, I hadn't ever used tensor flow and the machine I was working on this project with is not the most powerful. I realized that I could use Markov chains to do this. Markov chains can be used to generate connections between the various words in a vocabulary which is determined by the source text. So all, I would have to do, is get the text, put the text into my r-studio environment, run the markov chain generator, and then send the results to twitter. Easy right? Well, it's a simple process on paper so I decided to do it in an afternoon. That afternoon hasn't quite ended yet.

Rushing In

I knew twitter had an api, so I figured I could get a developer account, set up a few auto calls and pull the tweets. Except I knew nothing about the api nor the rules that came with it. From what I can tell, I can't save or distribute the text from twitter, just links to the tweets which is a little disappointing, but you gotta follow the rules. The twitter api is fairly straight forward, however, there are some limitations in the number of tweets you can pull and how far back you can pull tweets. You can only pull up to 3200 tweets per user, and on top of that there is a limit to how far back you can pull apparently. (I'm still not clear on that limit).

After doing some testing with the api, I found that I've Pet that Dog wasn't going to get me nearly enough tweets to do anything super interesting. Many of their tweets were replies, retweets, or sponsored posts. I was also only able to return tweets up through April of the year before. I decided to look for more Dog twitter accounts and found @dog_rates and pulled their tweets as well. I now had 824 training tweets to use.

I then took all those tweets, and through them into a simple markov chain generator and got a lot of totally incomprehensible text back out. I realized reading it that a lot of this was because of the volume of names. I realized I was maybe going to have to plan this out a little more than I had before. This was a slightly more involved project then I had anticipated, and I hadn't even figured out how to automate tweeting yet without a cloud service.

The Plan

Since the rushing in wasn't working, I decided to step back and figure out what my goals were. I realized my primary goal was to generate the text. I didn't care as much about pulling all the information automatically, as I could do this myself on a scheduled basis. So my goal was to emulate a dog twitter account's tweets. My smaller goals were to automate the various steps.

So I looked at the garbage I had gotten out of the markov models. As mentioned earlier at lot of the issues seemed to stem from the number of various names. because of how markov chains work with probability of connections between various words and the limited sample size, I realized that I would probably get better results if I were to remove the names and put in some place holder text. figured I could accomplish this easily enough. I also hadn't realized that any tweet with media in it when pulled out has a url in the text. These random assortment of letters were causing many issues.

Then, once I got the data cleaned I could see how the text generation was. (Spoiler alert, it was a lot better) Seeing this, I realized I would now need to insert names again. I could have done this in R, but I decided to use excel to do this as I was storing all the tweets there and posting them myself using tweetdeck to schedule them out.

I finally had a sort of plan coming together.

Get the initial set of tweets
Remove any links from the tweets I was using
Remove the names, replies, and posts that didn't directly comment on various dogs.
Generate an initial batch of potential tweets and select some to send out.
Reinsert names to tweets
Schedule tweets via tweet deck
Pull new tweets and clean that data
Generate new tweets
Reinsert names in new tweets.
Repeat steps 7-9 ad nauseam

Data Cleaning

This is probably the most important step in any data project. Cleaning and preparing data is crucial to getting results worth sharing. It is also probably the most tedious part of the project. At first I thought I could be smart and just use excel's divide into rows to separate the names from tweets. After all, with I've Pet That Dog, most tweets begin with, "I pet name" and We Rate Dogs usually starts with "This is name" however, there was a lot more variation than I initially imagined. This method was generally successful, but not to the point that I could just do it without checking. Additionally, some tweets would have multiple dogs, which meant I had to figure out how to detect that. Because automating this wasn't a main goal, I decided to just do this manually. After all, once I had the main set of 824 done, I could do the new tweets each week (usually less than 30-40) fairly easily. I'm not 100% satisfied with this solution. However, it has worked for moment. The biggest pitfall came when I found a new twitter account to sample: @thedogist. I was able to pull just under 3000 new tweets to clean. This is a slow solution, but I haven't gotten around to improving it.

My process is fairly simple,

Pull the tweets in R and remove any URLs
Save in a CSV.
Paste tweets in a master list in 1 column (I have other information in this sheet like tweetID, date tweeted, etc )
In the next column, I put in the first name that appears in the tweet
The third column over, I put the second name, if there is one
If there are multiple names beyond that, I generally remove the tweet from the set.
I then use substitute in excel to replace the names in the tweet with name1 and name2 respectively. (I could probably do the substitution in R, but as I'm doing the name labeling manually in excel, why not just do it there?)
I then load the cleaned tweets into R for use in markov chains.

Easy right? Not at all time consuming. This is my next target for improvement.

Tweet Generating

In R I'm using the package markovchain to generate the Markov chains and the package tidytext to remove punctuation and to unlist all the words. I then generate the tweets making about 1000 at a time with random sentence length of 5 to 42 words. I then paste the generated tweets into a text file and select a some for tweeting. I was just storing these in an excel document until recently.

Scheduling the Tweets

My initial process for this was very manual. I had had an excel document with all the tweets generated.

I would get a list of random names
Re add the names with substitute in excel
Go through and make adjustments to punctuation so the tweets looked more normal
I then would upload a picture of a dog from picturesforclass.com to tweet deck and schedule the tweet.

This was exciting at first, but after 125ish tweets this got to be tedious to do. I knew the twitter api would allow me to tweet, but my issue was scheduling and not having to do the whole uploading an image.

I however found a better way. I discovered that you can use google apps scripts (basically java script) to access the twitter api and to upload tweets based off information in google sheets. The main issue is that I don't know javascript. To fix this, I spent a day learning how the google apps script api worked and made a rudimentary script that would trigger twice a day at around 11:45 am and 12:45 pm and tweet from a list in google sheets. I made a table with 353 different picture urls that I scraped from picturesforclass.com by writing a script to look at the html and pulling the image links. I also added 2 columns of names to randomly select from. I then randomly assign these using the formula, "VLOOKUP(Round(rand()*353),RNG!A:D,3,FALSE)" This automated the assigning pictures and names and I complied the tweets using substitute to put the random names in. Then I load all the tweets I want it to tweet for the next little while and can mostly ignore the sheet until I generate new tweets.

Summary

Overall, I think this has been a good on going project. As I realize something is not as easy as it could be, I simply find a way to automate that part of the job. My next target is name identification, I'm learning about how to do this, potentially with NLTK. If I can get that automated, I can then move a lot of this project to being automated, which would be ideal. I do want to try some other text generating models, like an RNN to see if I can get some better tweets. Until, then, feel free to follow my bot @dogsMarkov

Search