Copy files from one S3 bucket to another

I have an S3 bucket that contains several hundred files in a folder. I needed to copy those files into a different folder in another bucket. Sounds simple enough? but was unable to find a simple way to do this through the AWS Console. I found a number of stack overflow articles that talked about using Sync, or downloading the files and re-uploading them. None of which sounded particularly appealing.

In the end I just wrote this bash one liner (which I can probably optimise further by not repeating the sourcebucket / sourcefolder three times):

This just uses s3cmd to list all the files in the bucket/folder I wish to copy from. The output of that is piped to awk which I use to extract the s3 url of each file. I then use tail to remove the first line which I don’t need. I then use sed to build up a ‘s3cmd cp’ command which copies the file from its original location to my new location.

If anyone can suggest a better way that doesnt require me having to download the source files … I’d love to hear it.
If you can’t see the embedded Gist above then you can view it here.

nodejs + express with jsonp example

I’ve been working more and more with nodejs and have to say I am really loving how easy it’s been to get to grips with. I’ll be posting up more about how I’m using node and the problems I’m using it to solve over the coming weeks. However I wanted to illustrate just how simple it is do something that would be more work in other languages such as PHP.

So here’s a quick example I created that illustrates how to make a simple JSONP call to a nodejs + express server.

Here’s the server side code:

and here’s the client side Html code:

To enable jsonp callback support in an ExpressJS application you just have to include the line:

app.enable(“jsonp callback”);

Once you’ve done this you can use the .json() method on the Response object to handle everything for you. So in my example above any HTTP GET request to /foo?cb=myfunction would return myfunction(“hello world”); with the Content-Type header set to text/javascript.

To Smarty, or not to Smarty?

At Talis we have been doing more than just our fair share of PHP development, in fact we are using the language to implement a number of new prototypes, products and services. One of the questions that I’ve been struggling with recently is whether or not we should be using the Smarty template engine. Whilst we have used it with a measure of success on several projects, I’ve been struggling with what Smarty represents. Smarty is written in PHP and allows you to create templates that are marked up using Smarty’s own language syntax however this means that generated pages are precompiled at runtime down to pure PHP before being served. What strikes me is that Smarty is a Template Engine written in a language that was designed primarily for templating. So the question I’m left struggling with is: Why do you want to add the overhead of a secondary language to do what your first language already does and is fully capable of doing?

The more I think about it, the more I’m convinced that the answer is that you dont! I’m going to try to explain why by examining some of the reasons that are often cited for using Smarty.

Separating PHP from HTML

I’ve often heard his mantra, but I would describe it in a slightly different way and that’s by saying that what’s actually desired is to keep keep your “business logic” separate from your “presentation logic”, which is subtly different. If you accept that the two are indeed different then you should realise that all that’s actually required is a way to keep your Views and Controllers separated which can be achieved using plain PHP in your view scripts (templates), without the overhead of having to add a new language into the mix.

Easier to read

I’ve always struggled with this and here’s why … does anyone really think that this

  1. {section name=aValue loop=$list}
  2. Value is: {$aValue}
  3. {/section}

…is really easier to read than its pure PHP equivelant? here …

  1. <? foreach ( $list as $aValue ) { ?>
  2. Value is: <?=$aValue?>
  3. <? } ?>

… I honestly don’t think that it is. I know that there’s a view in the community that there are front-end only Developers out there or Web Designers that shouldn’t have to learn PHP, but are they really better off being taught a different language? Firstly I’d question if there are actually Developers out there who know Smarty and don’t have an understanding of PHP, but for the purposes of this exercise lets assume that these mythical smarty only developers do exist then I have to confess I agree entirely with Harry Feuks1 when he says

Walk them through a few sections of the PHP Language Reference
and show them how PHP‘s control structures and simple functions like
echo() works and that’s all they need.

and also with Brian Lozier2 who describes it much more convincingly when he says:

Basically, it just provides an interface to PHP with new syntax. When 
stated like that, it seems sort of silly. Is it actually more simple to 
write {foreach --args} than ? If you do think it's
simpler, consider this. Is it so much simpler that there is value in 
including a huge template library to get that separation? Granted, 
Smarty offers many other great features (caching, for instance), but it 
seems like the same benefits could be gained without the huge 
overhead of including the Smarty class libraries.

Speed

I’ve read several articles online that try to explain that there is a performance overhead when including Smarty. Yes I know you can cache your Smarty templates to improve performance in fact it’s not a question of ‘can’ it’s a question of ‘you really have to’ as Harry points out in the same article

if you hadn’t chosen to write your own programming language
and built an interpreter with PHP to parse it, you wouldn’t have slowed
down your application in the first place!

I performed a little test to try and illustrate what happens when you simple include Smarty, let alone actually use it. First off create yourselves a file called test.php and add the following line of code to it:

  1. <?php echo="Hello world" ?>

Now let’s benchmark this using Apache Benchmark, thanks to Vivek for providing some useful documentation on how to do this3. Open up a terminal window and issue the following command:

  1. ab -c 10 -n 100 http://localhost/test.php

Which on my system ( mac osx 10.5.2 4gb ram php5+apache2 ) resulted in the following:

nadeemshabir$ ab -c 10 -n 100 http://localhost/test.php
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done


Server Software:        Apache/2.2.9
Server Hostname:        localhost
Server Port:            80

Document Path:          /test.php
Document Length:        11 bytes

Concurrency Level:      10
Time taken for tests:   0.58753 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      19695 bytes
HTML transferred:       1111 bytes
Requests per second:    1702.04 [#/sec] (mean)
Time per request:       5.875 [ms] (mean)
Time per request:       0.588 [ms] (mean, across all concurrent requests)
Transfer rate:          323.39 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0       2
Processing:     2    4   5.4      4      34
Waiting:        1    4   5.5      3      33
Total:          2    5   5.3      4      34

Percentage of the requests served within a certain time (ms)
  50%      4
  66%      4
  75%      5
  80%      6
  90%      8
  95%     14
  98%     27
  99%     34
 100%     34 (longest request)
Nadeems-Computer:smarty-test nadeemshabir$

So the important figures are Request Per Second: 1702.04, Time per request: 5.875 [ms] (mean), Time per request: 0.588 [ms] (mean, across all concurrent requests) and Transfer rate: 323.39 [Kbytes/sec] received. Now let’s create a new test file called test-smarty.php, and add to it the following:

  1. <? include_once( ‘smarty/libs/Smarty.class.php’ ); ?>
  2. <?php echo "Hello world" ?>

and run the same test again …

  1. ab -c 10 -n 100 http://localhost/test-smarty.php

which results in the following:

nadeemshabir$ ab -c 10 -n 100 http://localhost/test-smarty.php
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done


Server Software:        Apache/2.2.9
Server Hostname:        localhost
Server Port:            80

Document Path:          /test-smarty.php
Document Length:        11 bytes

Concurrency Level:      10
Time taken for tests:   0.310262 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      19500 bytes
HTML transferred:       1100 bytes
Requests per second:    322.31 [#/sec] (mean)
Time per request:       31.026 [ms] (mean)
Time per request:       3.103 [ms] (mean, across all concurrent requests)
Transfer rate:          61.24 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   4.7      0      17
Processing:     9   26  14.3     22      84
Waiting:        8   25  14.1     22      82
Total:         13   28  14.7     24      87

Percentage of the requests served within a certain time (ms)
  50%     24
  66%     28
  75%     35
  80%     37
  90%     46
  95%     70
  98%     84
  99%     87
 100%     87 (longest request)
Nadeems-Computer:smarty-test nadeemshabir$

Now we can see a dramatic difference those important figures are now – Request Per Second: 322.31, Time per request: 31.026 [ms] (mean), Time per request: 3.103 [ms] (mean, across all concurrent requests) and Transfer rate: 61.24[Kbytes/sec] received. If you take simply the requests per second we are now only serving almost a sixth of the requests we originally served, and that’s without even using Smarty, that’s from just simply including the codeline!

Conclusions

To my mind those results and the arguments I’ve cited are pretty emphatic, I can’t justify using a template engine that adds so much overhead when you can achieve the same results using a pure PHP implementation. This view seems to be widely held and accepted, it’s part of the reason why many other Template Engines like Savant, Zend_View and Solar_View all embrace a different ethos to Smarty i.e they don’t compile your templates into PHP because they use PHP as the template language.

If you are at all unconvinced by the arguments I’ve presented then like me you might want to consider the words of Hasin Hayder who is the author of the popular book Smarty PHP Template Programming and Applications which I own a copy of :). He also created this Smarty Cheat Sheet which I along with several of my colleagues have copies of on our desks. Earlier this year Hasin wrote an article entitled Once upon a time there was Smarty, in this article Hasin touches on many of the points that I and others have made, and whilst the depth of my knowledge of smarty could easily be challenged Hasin is without question an expert, so he when he said …

I seriously don’t think there is need of Smarty anymore. Its dead! If 
you guys don’t agree with me, you can spend hell lot of time learning 
that {$name} actually does what you could do with “echo $name”. If 
you write a dispatcher which is smart enough to bring the variables 
from a controller scope to the scope of a view, why do u need to learn 
a separate parser like smarty? ... Learning all those functions, loops, 
logics, caching and others in smarty takes so much time that I would 
love to suggest you to learn writing efficient PHP code in template layer 
rather than learning something like smarty ...

… I stopped and listened … and then I found myself agreeing with his closing statement …

Sometime it’s better to realize thats its time to unlearn something.
  1. phpPatterns, http://www.phppatterns.com/docs/design/templates_and_template_engines[back]
  2. Template Engines, http://www.massassi.com/php/articles/template_engines/[back]
  3. How to performance benchmark a web server, http://www.cyberciti.biz/tips/howto-performance-benchmarks-a-web-server.html[back]

WWW2008: Day 2 – LDOW2008 Workshop

We spent all of yesterday in the Linked Data on the Web Workshop. It was quite an intense day with 27 different presentations, most of which were paper presentations in addition to a few demo’s. It was an excellent workshop so full credit to everyone who helped organise the event.

The workshop began with some short introductions by Sir Tim Berners-Lee, Chris Bizer and our very own Tom Heath. Both Chris and Tom did a great job chairing the workshop during the day and deserve credit for their efforts. After the introductions we went straight into presentations. I won’t try to describe every talk because there were so many and all of them were very good. I just want to talk about some of the highlights for me during the workshop.

Linked data is the Semantic Web done as it should be, the Web
done as it should be.
       Sir Tim Berners-Lee

For me this single statement by Tim, as part of his introduction to the workshop, captures the importance of the whole Linked Data movement. The vision of the Semantic Web cannot come to fruition unless we have linked data, as Tim pointed out back in 2006:

The Semantic Web isn't just about putting data on the web. 
It is about making links, so that a person or machine can explore 
the web of data.  With linked data, when you have some of it, you 
can find other, related, data.

Unsurprisingly every one of the presentations in the workshop aimed to describe technologies, processes, techniques and examples of linking data together semantically, to help make this vision a reality. There are many obstacles to being able to do this, some of these obstacles are technical but others are social and legal (You can view the workshop schedule here and download all of the papers), and we need to understand them all.

We Talisian’s did a couple of presentations during the workshop. I was originally supposed to present the Semantic Marc paper with Rob, but we only finished the slides the night before and decided it would be easier if he presented it without an interruption to change speakers. This proved to be the right decision since he did an excellent job, and we got some great feedback from many of the attendees.

Paul also did a presentation on Open Data Commons, his presentation was, to my mind, was far more important because I don’t believe the Linked Data community has fully understood why there is a need to license data. His presentation led to an interesting discussion and I was surprised to see that there were some people who did not understand why this was such and important issue. From what I recall the canon of their argument was that we have thousands of mashups out there re-using and re-mixing data at the moment so why do we need a Open Data Commons? RDF Book Mashup was cited as an example. What amused me was that it’s widely accepted that the RDF Book Mashup violates Amazon’s Terms of Use. Those arguing against the need for Open Data Commons were seemingly confusing that with the fact that so far Amazon hadn’t chose to do anything about RDF Book Mashup. This misses a fundamental point, Amazon doesn’t necessarily care about what people do with mashups because these are not commercial products. If someone took RDF Book Mashup and used to deliver a rival service to Amazon, I suspect that Amazon would act, and they would be well within their right to do so. Open Data Commons provides protection for Data Providers by giving them a mechanism, like the various OS Licenses did for the Open Source Community, to state under what terms people can use their data. Reciprocally it provides protection for those consuming the data since we know the terms under which the data has been made available to us. This notion that just because existing data providers haven’t sought remedial action against those that abuse their terms of service, we don’t have to worry about anything and don’t need the protection that Open Data Commons provides is naive at best and at worst it could cause the kind of damage that would make it very difficult to create this web of linked data. I guess the linked data community needs to mature in the same way the Open Source community did and also the Creative Commons community did.

One of the major themes that across in many of the talks, which was also a central theme in our paper, is how to handle disambiguation. There were a number of presentations that touched upon this issue most memorably the presentations by Alexander Passant on The Meaning of Tags, and Affraz Jaffri’ presentation on Uri Disambiguation.

I was also impressed by Jun Zhao presentation on Provenance and Linked Data in Biological Datawebs, I was fortunate enough to visit HP Labs last year and see Graham Klyne, one of her colleagues, present some of their work and it’s great to see how well they are doing.

I was also impressed by some of the work that Christian Becker has been doing with Chris Bizer on DBPedia Mobile a location centric DBPedia client application that uses a really cool Fresnal based Linked Data browser. Peter Coetzee’s work on SPARQPlug was also very impressive, and I’ve made a mental note to have a play with it as soon as I get back to the UK.

I could carry on and on but I think it’s sufficient to say it really was a wonderfully useful workshop, and I thoroughly recommend reading all of the papers that were presented.

Oh and I have just realised that Rob has posted up his thoughts about the workshop here.

Firefox Extension: TinyURL Creator

If you haven’t already got it installed I highly recommend the TinyURL Creator Firefox Extension. I’ve been using it more and more recently, simply browse to a page and right click, select ‘Create TinyURL for this Page’ and it generates the url and places it on the clipboard ready for you to paste.

For those who don’t know what a TinyURL is TinyUrl is a service that takes a long URL as input, and gives you a short URL to use in it’s place. For example the TinyURL for http://www.virtualchaos.co.uk/blog/ is http://tinyurl.com/39sbbh. This comes in useful if you want to put a link into Twitter or in an SMS message where you have a limited number of characters to use. I’ve been using the service a lot, but it’s so much easier with the firefox extension.

IE5 more compliant than IE7? on ACID3?

Steve Noonan maintains a page where he collects and publishes results for various browsers tested against the newly released ACID3 Standards. Unsurprisingly no browser currently scores 100%, and it’s not surprising to me that Internet Explorer is way down the bottom of the list for compliance … but how the hell did IE5 score more than IE6 and IE7?

Automated Testing Patterns and Smells

Wonderful tech talk by Gerard Meszaros who is a consultant specialising in agile development processes. In this particular presentation Gerard describes a number of common problems encountered when writing and running automated unit and functional tests. He describes these problems as “test smells”, and talks about their root causes. He also suggests possible solutions which he expresses as design patterns for testing. While many of the practices he talks about are directly actionable by developers or testers, it’s important to realise that many also require action from a supportive manager and/or system architect in order to be really achievable.

We use many flavours of xUnit test frameworks in our development group at Talis, and we generally follow a Test First development approach, I found this talk beneficial because many of the issues that Gerard talks about are problems we have encountered and I don’t doubt every development group out there, including ours, can benefit from the insight’s he provides.

The material he uses in his talk and many of the examples are from his book xUnit Test Patterns: Refactoring Test Code, which I’m certainly going to order.