Tweet Dump part 3

Welcome back to another, and probably the last, instalment of the tweet dump project.

The old tweet dump code was done locally. I wanted to see how things would be affected if I ran the tweet dump code to a remote server. Just to get an idea on how travel would affect latency.

So I changed things up a bit, I created a sql database through my hosting provider and imported the schema into the remote database. At which point I also modified up the tweet dump code to use this new remote database. And after running the python script 5 times, here are my numbers:

Run 1 43.998
Run 2 46.352
Run 3 45.029
Run 4 55.024
Run 5 49.174
Average 47.92

In my tweetdump code I have a lot of going back and forth between the server and client;

  1. def getId(in_id):
  2. sql = "select id from ids where twit_id = '" + in_id + "'"
  3. return runQuery(sql)
  4.  
  5. def addId(in_id):
  6. sql= "insert into ids values (null, '" + in_id + "')"
  7. runQuery(sql)
  8.  
  9. def addTweet(in_id, in_tweet):
  10. sql = "insert into tweets values ('" + in_id + "',\"" + in_tweet + "\")"
  11. runQuery(sql)
  12. ...
  13. site_id = getId(ids[j])
  14. if( not site_id):
  15. addId(ids[j])
  16. site_id = getId(ids[j])

When you see the three functions making up the code block near the bottom, you might realize that there are three seperate sql calls to the server. And while this might be fine for a local database, this is not good for a remote one. So I decided to try my hand at creating a sql function to reduce some of the back and forth between the two entities.

  1. delimiter //
  2.  
  3. DROP FUNCTION IF EXISTS `get_set_ids`//
  4. CREATE FUNCTION get_set_ids( in_tweet_id BIGINT )
  5. RETURNS int(10) UNSIGNED
  6. BEGIN
  7. DECLARE new_id int(10) UNSIGNED;
  8. SELECT id INTO new_id FROM ids WHERE twit_id = in_tweet_id;
  9. IF ISNULL(new_id) THEN
  10. INSERT INTO ids VALUES (NULL, in_tweet_id);
  11. SELECT id INTO new_id FROM ids WHERE twit_id = in_tweet_id;
  12. END IF;
  13. RETURN new_id;
  14. END //
  15.  
  16. delimiter ;

The function above takes the guess work away from the client, and keeps it within the server. By doing this we avoid and entire round of communication between the client & server. So in theory we should see at least some kind of speed-up.

With the function above created I then went and modified the script to take advantage of the function:

  1. def get_set_id(in_id):
  2. sql = "select get_set_ids(" + in_id + ")"
  3. return runQuery(sql)
  4. ...
  5. site_id = get_set_id(ids[j])
  6. addTweet(str(site_id[0][0]), final_texts[j])

After a bit of testing to make sure things worked, I ran five timed tests (on the same hardware and from the same location, to try and reduce any variables that might crop up).

Run 1 25.948
Run 2 24.35
Run 3 26.181
Run 4 24.667
Run 5 25.352
Average 25.3

The difference between the two averages is about 22 seconds. And to be honest, I did not expect these changes to cut down my times by about half. This was a little bit of a shock to me. I guess in this end this is an example of how design can really matter.

TPS/ Status Report

Sorry, today's post will not contain anything technical. The past two weeks I've been busy studying for my last finals, of which I passed. :) As well as packing and moving from Portland to San Francisco. Because of all the craziness that life has bestowed upon me recently I won't be able to post anything geeky this week. But there is a silver lining to this, I'll post something worth while next week.

Also, I've decided to change my format slightly, I will be posting twice a month, instead of weekly, from now on. I'm doing this in the hopes in reducing the timeline of posting, that I can produce technical posts of higher quality, which I'm sure everyone would appreciate.

See you all next week.

Using curl and a user agent string for web scraping pt 2; Now with PHP

First off, Happy Thanksgiving to everyone, US readers especially. I know I should have written this sooner, but the holiday kept me pretty busy this year. And I hope you can accept my late well wishes.

And after you've had your fill of turkey, stuffing, and some pumpkin pie maybe you got the chance to play around with what I posted last week. And now lets build from that. Let's change the situation around a little bit; and say that a command line utility isn't your thing for this project. And you would like to use PHP.

First off, you'll want to make sure that your web server has curl compiled in. And the easiest way is to use phpinfo() to find it, and you'll want to look for this:

If you see something similar to what I have above, your good. Otherwise you'll either need to recompile your php to include libcurl, or complain to your hosting provider. Afterwards we can accomplish what we want with a couple lines of code:

<?php
 $curl_options 
= array(
    
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.4) Gecko/20091030 Firefox/3.5.4",
    
CURLOPT_RETURNTRANSFER => 1
 
);
 
 
$url"http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher";

 
$ch curl_init($url);
 
curl_setopt_array($ch,$curl_options);
 
$content curl_exec($ch);
 
curl_close($ch);
?>

And now in your PHP code, you'll have a variable called $content that will contain all of the html code from the website. And how you wanna parse information outside of that is up to you. Of course if anyone has questions I will be happy to help out as best as I can.

Using curl and a user agent string for web scraping

According to Wikipedia web scraping is "is a computer software technique of extracting information from websites." So let's say you wanted to write a script to grab the number of trophies from Sony's PS3 system. Web scraping might be a technique you might want to consider. First move would be to use a web browser to get the URL to eventually scrape:

"http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher" (yes, that is my ps3 user id and yes those are my stats)

Next step one would probably use a text browser to facilitate the web scraping, like wget. So next step would be to put these two together and see what happens:

  1. wget http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher

But what is this:

  1. wget <a href="http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher
  2. --2009-11-17" title="http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher
  3. --2009-11-17">http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher
  4. --2...</a> 19:52:56-- <a href="http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher
  5. Resolving" title="http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher
  6. Resolving">http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher
  7. Res...</a> profiles.us.playstation.com... 174.129.213.213
  8. Connecting to profiles.us.playstation.com|174.129.213.213|:80... connected.
  9. HTTP request sent, awaiting response... 403 Forbidden
  10. 2009-11-17 19:52:56 ERROR 403: Forbidden.

Let's see if this is what happens when we use curl (another website grabber):
 curl http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher

Now the response is even different than when using wget, but the most important part is in the middle of the response ( shortened for readability):

  1. <h2 align="center">Forbidden</h2>
  2. <h4 align="center">Access is forbidden to this URL</h4>

So, now we see 3 inconsistencies here. With a "normal" web browser, things work fine. Wget is down right rejected, and curl is reject but a little more politely. At this point it looks like our web scraping project would seem lost, but I don't give up that quickly. One thing these three things have in common is a user agent. So if we can tell the website we are a different user agent than what we really are, things should work. Luckily enough for curl has this ability. And after a little research we can now try new command to see if it works:

curl -A 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.4) Gecko/20091030 Gentoo Firefox/3.5.4' http://profiles.us.playstation.com/playstation/psn/profiles/L_Cypher

And VIOLA, we now get access to the site as if we were using firefox. (html code not included for readability. If you wanna see the code you can run it yourself.)

This is a good idea to keep in your mental toolbox, you never know when you might need a trick like this in the future.

Update (4/18/2010): A nice guy named John over at ps3inside.de emailed me and told me that the link above no longer works. I tested it myself and John is correct. It turns out that Sony changed things up a little bit and now the trophies page is all done in javascript. Being that javascript isn't html the code above will not work.

Syndicate content