I wanted to use a scripting language and decided to give Ruby and Groovy a try.
In Ruby there is the Mechanize library. In Groovy there are different options.
The Ruby Mechanize library seems very intuitive:
require 'rubygems' require 'mechanize' a = WWW::Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari' } a.get('http://google.com/') do |page| search_result = page.form_with(:name => 'f') do |search| search.q = 'Hello world' end.submit search_result.links.each do |link| puts link.text end end
I like the DSLish way to both, scrape (eg:
earch_result.links.each
) and manipulate (eg: search.q = 'Hello world'
) a web page.
In Groovy scraping is also pretty DSLish:
def page = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parse('http://groovy.codehaus.org/') def data = page.depthFirst().grep{ it.name() == 'A' && it.@href.toString().endsWith('.html') }.'@href' data.each { println it }
Manipulating a web page with groovy unfortunately is clumsier:
import com.gargoylesoftware.htmlunit.WebClient def webClient = new WebClient() def page = webClient.getPage('http://www.google.com') // check page title assert 'Google' == page.titleText // fill in form and submit it def form = page.getFormByName('f') def field = form.getInputByName('q') field.setValueAttribute('Groovy') def button = form.getInputByName('btnG') def result = button.click() // check groovy home page appears in list (assumes it's on page 1) assert result.anchors.any{ a -> a.hrefAttribute == 'http://groovy.codehaus.org/' }
check out the 'Updating XML with XmlSlurper' article on codehaus.org.
ReplyDeletehttp://groovy.codehaus.org/Updating+XML+with+XmlSlurper
or http://tinyurl.com/bn8fw3
I don't know how current it is, but it seems a little more DSL than the example you've shown.
@Kevin Williams
ReplyDeleteThanks for the link to 'Updating XML with XmlSlurper'.
But as far as I can see, the scenario is not applicable for web-automation.
XmlSlurper seems to allow me to modify the (in-memory) text-representation of a web-page. But it does not let me manipulate the web-page itself...