require 'pupa'
require 'csv'
Download this example’s Ruby code to run locally.
The cat.rb, bill.rb and legislator.rb examples show you how to scrape and import data. This example shows you how to transform scraped data.
require 'pupa'
require 'csv'
We’re going to scrape organizations and output them as CSV, which we can then upload to the Open Knowledge Foundation‘s Public Bodies project.
class PublicBodyProcessor < Pupa::Processor
This transformation task will write a CSV row for each scraped organization. You can name transformation tasks whatever you like.
def csv
puts CSV.generate_line %w(
title
abbr
key
category
parent
parent_key
description
url
jurisdiction
jurisdiction_code
source
source_url
address
contact
email
tags
)
organizations
is a lazy enumerator of all scraped organizations, so
we’ll see a CSV row printed as soon as an organization is scraped.
organizations.each do |organization|
puts CSV.generate_line [
organization.name,
nil,
organization._id,
organization.classification,
nil,
nil,
nil,
'New Brunswick',
'ocd-division/country:ca/province:nb',
organization.sources[0][:note],
organization.sources[0][:url],
organization.contact_details.address,
organization.extras[:contact_point],
organization.contact_details.email,
nil,
nil,
nil,
]
end
end
To keep this example short, we’ll just scrape the departments and agencies of the Government of New Brunswick.
def scrape_organizations
url = 'http://www1.gnb.ca/cnb/DsS/display-e.asp?typyofPublicBodyID=1'
doc = get(url)
doc.xpath('//table[4]//table').each do |table|
organization = Pupa::Organization.new
organization.name = table.at_xpath('.//u').text
address = table.text.strip[/\A#{Regexp.escape(organization.name)}(.+?)(?=Co-ordinator:|Email:|Phone:|Fax:)/m, 1].gsub(/[[:space:]]{2,}/, "\n").strip
email = clean(table.at_xpath('.//a/@href').value).sub(/\Amailto:/, '')
contact_detail = table.at_xpath('.//u[text()="Co-ordinator"]').next.text.sub(/\A: /, '')
organization.add_contact_detail('address', address)
organization.add_extra(:contact_detail, contact_detail)
organization.add_contact_detail('email', email)
organization.add_source(url, note: 'New Brunswick Directory of Public Bodies')
dispatch(organization)
end
end
end
PublicBodyProcessor.add_scraping_task(:organizations)
runner = Pupa::Runner.new(PublicBodyProcessor)
Registers the csv
action, so that we can run it with:
ruby organization.rb --action csv > output.csv
runner.add_action(name: 'csv', description: 'Output organizations as CSV')
runner.run(ARGV)
You’ve won at Pupa.rb! Explore the class and method documentation to learn how to do even more with Pupa.rb.