Skyscraper 0.3
Ascending to the next floor
Daniel Janus
clojureD, 2020-02-29
Scraping flow
(def seed
[{:url "https://conference.news"
:processor :homepage}])
Scraping flow
[{:title ":clojureD held in Berlin"
:art-id "1"
:url "https://conference.news/article/1"
:processor :article}
{:title "People enjoying themselves at :clojureD"
:art-id "2"
:url "https://conference.news/article/2"
:processor :article}
{:title "Awesome talks all around"
:art-id "3"
:url "https://conference.news/article/3"
:processor :article}]
Scraping flow
[{:title ":clojureD held in Berlin"
:art-id "1"
:body "This is the text of the first article."
:num-comments "31"}
{:title "People enjoying themselves at :clojureD"
:art-id "2"
:body "Here comes the second text."
:num-comments "41"}
{:title "Awesome talks all around"
:art-id "3"
:body "And a third one."
:num-comments "59"}]
Processor
(defprocessor :homepage
:cache-template "index"
:process-fn (fn [doc ctx]
(for [x (select doc "ul.articles li a")
:let [url (attr x :href)]]
{:title (text x)
:url url
:art-id (last (string/split url #"/"))
:processor :article})))
New in 0.3
- Asynchronous (core.async FTW)
- Faster, more robust
- Database integration
Processor, again
(defprocessor :homepage
:cache-template "index"
:skyscraper.db/columns [:title :art-id]
:skyscraper.db/key-columns [:art-id]
:process-fn (fn [doc ctx]
(for [x (select doc "ul.articles li a")
:let [url (attr x :href)]]
{:title (text x)
:url url
:art-id (last (string/split url #"/"))
:processor :article})))
Running it, again
(scrape! seed
:db-file "conf.sqlite")
Lo and behold
sqlite> .schema
CREATE TABLE homepage (id integer primary key,
parent integer,
title text,
art_id text);
CREATE TABLE article (id integer primary key,
parent integer,
body text,
num_comments text);
Lo and behold
sqlite> select title, num_comments
from article
join homepage on article.parent = homepage.id;
title = Awesome talks all around
num_comments = 59
title = :clojureD held in Berlin
num_comments = 31
title = People enjoying themselves at :clojureD
num_comments = 41