Building a Pi News Aggregator¶
Preface¶
In this tutorial, we will use rampart-curl , the rampart-sql, the rampart-html, the rampart-robots, the rampart-url and the rampart-server modules. With these we will grab the latest articles from three Raspberry Pi news sites, process them and build or update a SQL database which will be used for a landing page and a search.
You can download the complete project from our Tutorial Repository in the pi_news directory.
License¶
The code related to this tutorial is released under the MIT license.
Concepts¶
This module will demonstrate:
- Using rampart-curl to grab articles over http.
- Using rampart-robots to comply with robots.txt exclusions.
- Using rampart-html to find article links, images and the main article content.
- Using rampart-url to normalize links extracted from html pages.
- Using rampart-server to create and serve the web interface.
- Using rampart-sql to store the article text and create a full text search engine.
In order to complete this tutorial, you should have basic knowledge of JavaScript, HTML and SQL.
Getting Started¶
Web Server Layout¶
In the Rampart binary distribution is a sample web server tree. For our purposes here, we
assume you have downloaded and unzipped the Rampart binary distribution into a directory named
~/downloads/rampart
.
We will use that for this project. The The
rampart-server HTTP module is configured and loaded from the included web_server/web_server_conf.js
script. It defines web_server/html
as the default directory for static html, web_server/apps
as the default
directory for scripts.
To get started, copy the web_server
directory to a convenient place for this project. Also, for our
purposes, we do not need anything in the web_server/apps/test_modules
or
web_server/apps/wsapps
directories, so you can delete the copy of those
files. We will also add an empty file at web_server/wsapps/wschat.js
and
web_server/html/websockets_chat/index.html
.
user@localhost:~$ mkdir pi_news
user@localhost:~$ cd pi_news
user@localhost:~/pi_news$ touch pi_news_aggregator.js
user@localhost:~/pi_news$ cp -a ~/downloads/rampart/web_server ./
user@localhost:~/pi_news$ cd web_server/
user@localhost:~/pi_news/web_server$ rm -rf apps/test_modules wsapps/* html/index.html
user@localhost:~/pi_news/web_server$ mkdir apps/pi_news
user@localhost:~/pi_news/web_server$ touch apps/pi_news/search.js
user@localhost:~/pi_news/web_server$ find .
./wsapps
./start_server.sh
./stop_server.sh
./logs
./apps
./apps/pi_news
./apps/pi_news/search.js
./web_server_conf.js
./html
./html/images
./html/images/inigo-not-fount.jpg
./html/websockets_chat
./html/websockets_chat/index.html
./data
The Aggregator¶
Let’s start with a skeleton script with the needed modules loaded by editing the pi_news_aggregator.js
file:
#!/usr/local/bin/rampart
rampart.globalize(rampart.utils);
var Sql = require("rampart-sql");
var curl = require("rampart-curl");
var html = require("rampart-html");
var robots = require("rampart-robots");
var urlutil= require("rampart-url");
var dbdir = process.scriptPath + "/web_server/data/pi_news"
var sql = new Sql.init(dbdir, true);
// The date format we are expecting from http servers
var dateFmt = "%a, %d %b %Y %H:%M:%S GMT";
var crawlDelay = 10; // go slow - delay in seconds between fetches from same site.
// Columns for our table
// status, fetch_count, server_date, fetch_date, site, url, img_url, title, text
var schema = "status smallint, fetch_count smallint, server_date date, fetch_date date, " +
"site varchar(8), url varchar(64), img_url varchar(64), title varchar(64), " +
"text varchar(1024)";
// running in two stages - first gets all articles from several pages
// - second gets latest articles from main page
// First stage is done by running with command line argument '--first-run'
var firstRun = false;
if ( process.argv[2] == '--first-run' )
firstRun=true;
var sites = {}
// the names of our sites.
var sitenames = Object.keys(sites);
function fetch(url) {
}
// process the index page holding links to articles
function procIndex(site, docBody) {
}
// update a row of data in our table
function update_row(data) {
}
// insert a new row into our table
function insert_row(data) {
}
// check for new articles in the main index page (sites[].url)
function check_new_articles() {
}
// on first run, get main index page and a few more index pages
function first_run() {
}
// get relevant text and info from article html
function procpage(site, dbrow, fetch_res) {
}
//fetch, process and update table
function fetch_article(sitename) {
}
// get article pages
function fetch_all_articles(){
}
// get index pages
if(firstRun) {
first_run();
} else {
check_new_articles();
}
// get article pages
fetch_all_articles();
Examining the External Sites¶
In order to parse the html to extract links, images and content, we need to have a peek at the each site’s structure. This is the portion of the script which we will need to keep up to date if any of the sites we are scraping change their format. But in short, we want to know:
- The URL of the index page with the links to the latest articles.
- The format of the URL for more of these page, for older content.
- How many of these extra pages to fetch on our first run.
- How to locate the links to articles and, if present, images from within the index page.
- How to locate the relevant content (and possibly an image link) within the article pages themselves.
- How to remove unwanted content inside our extracted “relevant content”.
So, without going into great detail, we will define all of these in our sites
Object. In this case, the relevant HTML tags can be found using CSS class names:
/* details of each scrape:
url - the index page with links to the latest articles
urlNextFmt - more index pages with links to articles
initialPages - how many pages with links to articles to grab on first run
entryClass - the CSS class of the element holding article links, for index pages
entryImgClass - the CSS class of the element on the index pages with a related image
contentClass - on article page, the CSS of element that holds most relevant text
contentImgClass - the CSS class of the element on the content pages with a related image
contentRemoveClass - the CSS class of elements on the content pages inside contentClass that should be removed
*/
var sites = {
"hackaday":
{
name: "hackaday",
url: "https://hackaday.com/category/raspberry-pi-2/",
urlNextFmt: "https://hackaday.com/category/raspberry-pi-2/page/%d/",
initialPages: 8,
entryClass: "entry-featured-image",
contentImgClass: "entry-featured-image",
contentClass: 'post'
},
"raspberrypi":
{
name: "raspberrypi",
url: "https://www.raspberrypi.com/news/",
urlNextFmt: "https://www.raspberrypi.com/news/page/%d/",
initialPages: 5,
entryClass: "c-blog-post-card__link",
entryImgClass: "c-blog-post-card__image",
contentClass: "c-blog-post-content",
contentRemoveClass: ["c-blog-post-content__footer"]
},
"makeuseof":
{
name: "makeuseof",
url: "https://www.makeuseof.com/search/raspberry%20pi/",
urlNextFmt: "https://www.makeuseof.com/search/raspberry%%20pi/%d/",
initialPages: 3,
entryClass: "bc-title-link",
entryImgClass: "bc-img",
contentClass: "article",
contentRemoveClass: ["sidebar", "next-btn", "sharing", "letter-from"]
}
}
Now that we have the basic information for each site, we need to set up a polite way of
fetching content, respecting robots.txt. We will make a fetch()
function that
automatically does this for us. Since all HTTP status codes are positive and greater than 100,
(i.e. 200
), we will
use codes less than 100 for errors related to the fetch, and later store the code in our table.
In this function, we will load the robots.txt file for each site only once, and then use it with each new url from that site to check if crawling is allowed.
// status < 0 -- try or retry
// status > 0 -- don't retry
// status = 0 -- first try
// status > 0 and < 100 - custom codes
var statTexts = {
"1": "Disallowed by robots.txt",
"2": "Cannot parse Url",
"-1": "Error retrieving robots.txt"
}
var robotstxt = {}; //load these once per run
var userAgent = "rampart_tutorial"; //our identification
// fetch with robots.txt check
function fetch(url) {
var comp = urlutil.components(url);
var origin, res;
if(!comp)
return {status: 2, statusText: statTexts[2]};
origin=comp.origin;
if(!robotstxt[origin]) {
var rurl = origin + '/robots.txt'
printf("fetching %s\r", rurl);
fflush(stdout);
// body is a buffer. robots.isAllowed also takes a buffer, so we can dispense with text.
res = curl.fetch(rurl, {"user-agent": userAgent, returnText: false, location:true});
printf("%d - %s\n", res.status, rurl);
if(res.status==200) {
robotstxt[origin]=res.body;
} else if (res.status > 399) {
robotstxt[origin]=-1;
} else {
// there are other possibilities not covered here
return {status: -1, statusText: statTexts[-1]};
}
}
// if there is a robots.txt, and this url is disallowed, return status=1
if(robotstxt[origin]!=-1 && !robots.isAllowed(userAgent, robotstxt[origin], url) ) {
return {status: 1, statusText: statTexts[1]};
}
printf("fetching %s\r", url);
fflush(stdout);
res = curl.fetch (url, {"user-agent": userAgent, returnText: false, location:true});
printf("%d - %s\n", res.status, url);
if(res.status > 499) res.status *= -1; //return negative on serve error, so we will retry
return res;
}
Defining the Table¶
Defining a schema for the table usually requires a bit of foresight to know exactly the data you need saved, and how it will be accessed. If we wanted to be very flexible, we could include a JSON field in our table, but for the purposes of this tutorial, that won’t be necessary.
We know we will want a way to uniquely identify each entry. Since each entry corresponds to a
specific URL, we will have that covered just by including the URL of the article (which we
would do in any case). We also want to save the site name and the status code returned from our
fetch()
function. In
addition, we will be extracting the article’s title, a related image URL and, of course, the
article text.
Other items we may want in the future include the date we fetched the article, the date of the article as returned by the server (we’ll do both) and the actual HTML of the article (which we will not do in this example). Saving the HTML in case we make a mistake or the format changes in the future is a wise strategy for anything beyond a demo as it allows you to correct a potentially bad situation without having to re-fetch many pages.
So we will make a schema, plus a couple of functions to create the table and the indicies on them.
// Columns for our table
// status, fetch_count, server_date, fetch_date, site, url, img_url, title, text
var schema = "status smallint, fetch_count smallint, server_date date, fetch_date date, " +
"site varchar(8), url varchar(64), img_url varchar(64), title varchar(64), " +
"text varchar(1024)";
function create_table() {
// create using schema
sql.exec(`create table pipages (${schema});`);
// unique index on url, so we don't have duplicates
sql.exec('create unique index pipages_url_ux on pipages(url);');
}
/*
Regular indexes are updated on each insert/update.
Text indexes, however, need to be manually updated.
- When a new row is inserted, it still is available for search,
but that search is a linear scan of the document, so it is slower.
- A text index can be updated with either:
1) sql.exec("alter index pipages_text_ftx OPTIMIZE");
or
2) issuing the same command that created the index as in make_index() below.
Here we will issue the same command when creating and updating for simplicity.
*/
function make_index(){
// we want to match "Check out the new pi 4." if we search for "pi 4"
// so we add an expression for [\s\,]\d+[\s,\.] (in PERLRE), but only matching the number(s)
sql.exec(
"create fulltext index pipages_text_ftx on pipages(text) " +
"WITH WORDEXPRESSIONS "+
"('[\\alnum\\x80-\\xFF]{2,99}', '[\\space\\,]\\P=\\digit+\\F[\\space,\\.]=') "+
"INDEXMETER 'on'");
}
We separate the creation of the fulltext
index into its own function since it will be run each time we
update the content.
Parsing The Index Page HTML¶
We now need to fetch the use the data stored in sites
to find links to the
articles and images in the index page of each site. To that end, we will use the following
function to find the class of the element (or one of its child elements) that contains the
href
and/or
src
of the article
URL or image URL we need.
Note that the extraction of the image URLs is perhaps not as tidy as we might like. MakeUseOf
stores it in <source>
tags, and for HackADay, it is easier to get the image URL
from the article itself. So we will have to make accommidations in our function.
// process the index page holding links to articles
function procIndex(site, docBody) {
var doc = html.newDocument(docBody);
var entries = doc.findClass(site.entryClass);
var images;
if(site.entryImgClass) {
images = doc.findClass(site.entryImgClass);
}
var insertRows = [];
for (var i=0; i<entries.length; i++) {
var row={site: site.name};
var entry = entries.eq(i);
var atag;
if(entry.hasTag("a")[0])
atag=entry;
else
atag = entry.findTag("a");
row.url = atag.getAttr("href");
if(row.url.length)
row.url = urlutil.absUrl(site.url, row.url[0]);
if(images && images.length) {
var image = images.eq(i);
if (image.length) {
var imgtag = image.findTag("source"); //makeuseof
if(imgtag.length) {
imgtag = imgtag.eq(0);
var srcset = imgtag.getAttr("data-srcset");
if(srcset[0]=='') {
srcset = imgtag.getAttr("srcset");
}
if(srcset.length)
row.img_url = srcset;
} else { //raspberrypi
if(image.hasTag("img")[0])
imgtag = image;
else
imgtag = image.findTag("img");
row.img_url = imgtag.getAttr("src");
}
}
if(row.img_url && row.img_url.length)
row.img_url = row.img_url[0];
}
insertRows.push(row);
}
return insertRows;
}
What we get in return is a Array of Objects, each object with properties that match the column names of our table. Naming the properties of the Objects the same as the column names will make for easy inserts and updates of records.
Inserting New URLs into the Table¶
We have some of the columns for each row in the return value of procIndex()
but there are
several missing ones. We will need to fill those in with blanks that will be updated when we
actually get the article. To do that, we will start with an empty Object, add parameters and blank values from empty_params
, then overwrite the
empty values with the parameters we in have data
, all using Object.assign()
function.
//properties we need in the substitution parameters object when doing an insert
var empty_params = {
status:0,
fetch_count:0,
server_date:0,
fetch_date:0,
img_url:"",
title:"",
text:""
}
// insert a new row into our table
function insert_row(data) {
var dfilled = {}; // start with empty object
Object.assign(dfilled, empty_params); // add default empty params
Object.assign(dfilled, data); // overwrite params with what we have
var res = sql.exec("insert into pipages values " +
"(?status, ?fetch_count, ?server_date, ?fetch_date, ?site, ?url, ?img_url, ?title, ?text);",
dfilled
);
// if duplicate, sql.errMsg will be set with "duplicate value" message and res.rowCount == 0
// Checking res.rowCount first is likely faster
if (!res.rowCount && sql.errMsg.includes("Trying to insert duplicate value") ) {
res.isDup=true;
// remove sql.errMsg
sql.errMsg = '';
}
return res;
}
The First Run - Grabbing the Index Pages¶
The first time we run the script, we will need to create our database and the unique index on
url
, and then grab
the index pages and subsequent pages to populate our database with a fair number of recent
articles (50-60 each).
// get single key response without \n
function getresp(def, len) {
var l = (len)? len: 1;
var ret = stdin.getchar(l);
if(ret == '\n')
return def;
printf("\n");
return ret.toLowerCase();
}
// drop table if exists, but ask first
function clean_database() {
if( sql.one("select * from SYSTABLES where NAME = 'pipages';") ) {
printf('Table "pipages" exists. Drop it and delete all saved pages?\n [y/N]: ');
fflush(stdout); //flush text after \n
var resp = getresp('n'); //default no
if(resp == 'n') {
process.exit(0);
}
sql.exec("drop table pipages;");
if(sql.errMsg != '') {
printf("Error dropping table:\n%s",sql.errMsg);
process.exit(0);
}
}
}
// on first run, get main index page and a few more index pages
function first_run() {
var i,j,k;
clean_database();
create_table();
for (i=0; i< sitenames.length; i++) {
var site = sites[sitenames[i]];
var res;
printf("Getting new article urls for %s\n", site.url);
res = fetch(site.url);
if(res.status != 200) {
printf("error getting '%s':\n %s\n", site.url, res.statusText);
process.exit(1);
}
// extract urls of articles
var urllist = procIndex(site, res.body);
// insert urls into table
for (j=0; j<urllist.length; j++) {
insert_row(urllist[j]);
}
// go slow
sleep(crawlDelay);
// get second and subsequent pages up to site.initialPages
for (k=2; k<=site.initialPages;k++) {
var purl = sprintf(site.urlNextFmt, k);
res = fetch(purl);
if(res.status != 200) {
printf("error getting '%s':\n %s\n", site.url, res.statusText);
process.exit(1);
}
// extract urls of articles
var urllist = procIndex(site, res.body);
// insert urls into table
for (j=0; j<urllist.length; j++) {
insert_row(urllist[j]);
}
//sleep unless at end
if( k < site.initialPages)
sleep(crawlDelay);
}
}
}
Notice the crawlDelay
variable. The sites have allowed us to index their content,
however they may change their mind if they see in their logs that you are grabbing multiple
pages at once. So it pays to be conservative.
Update Runs - Grabbing New Article URLs¶
Here we will check, perhaps after a day or two, if there are any new articles. Very similar to
above, we will fetch sites[i].url
and extract URLs. We’ll also print out some extra information
about the links we get.
Also remember we created a unique index on url
, so we do not have to worry
about inserting the same url more than once.
// check for new articles in the main index page (sites[].url)
function check_new_articles() {
// check that our table exists
if( ! sql.one("select * from SYSTABLES where NAME = 'pipages';") ) {
console.log(
'Error: table "pipages" does not exist.\n' +
'If this is the first time running this script, run it with\n' +
`${process.argv[1]} --first-run`
);
process.exit(1);
}
for (var i=0; i< sitenames.length; i++) {
var site = sites[sitenames[i]];
var res;
printf("Getting new article urls for %s\n", site.url);
res = fetch(site.url);
var urllist = procIndex(site, res.body);
for (var j=0; j<urllist.length; j++) {
printf("checking %s - \r", urllist[j].url)
fflush(stdout);
var sqlres = insert_row(urllist[j]);
if(sqlres.isDup)
printf("exists: %s\n", urllist[j].url);
else
printf("NEW ARTICLE: %s\n", urllist[j].url);
}
}
}
Updating Content¶
Once we actually fetch and process articles, we will need to insert the updated content into
the table. Since the results of each fetch will give us a varying list of properties in
data
, we will make a
function that will create an appropriate SQL statement for the update.
// update a row of data in our table
function update_row(data) {
var keys = Object.keys(data);
// build our "set" section of sql statement
// based on what we have in *data*
// should look like "text=?text, title=?title, ..."
var sqlstr="update pipages set ";
var j=0;
for (var i = 0; i<keys.length; i++) {
var key = keys[i];
if(key == 'url')
continue;
if(j)
sqlstr += ", "+key + "=?" + key;
else
sqlstr += key + "=?" + key;
j++;
}
// if we only have data.url, there's nothing to update
if(!j)
return;
printf("updating %s\n",data.url);
var res = sql.exec(sqlstr + " where url=?url;", data);
if(sql.errMsg) {
printf("sql update error, cannot continue:\n%s",sql.errMsg);
process.exit(1);
}
return res;
}
Processing the Articles¶
Once our Table has some URLs in it, we can go about the business of fetching and processing
them. To fetch, we will continue to use our robots.txt friendly fetch()
function from within a
new function called fetch_article()
. That function will select articles to be fetched and
updated based on their status
.
To process the HTML and extract the text, we will create a new function called procpage()
that will use the
rampart-html module to extract the title and text.
We will also put it all together with a setInterval()
in order to respect crawlDelay
per each site in a
function called fetch_all_articles()
. That function will select articles to be updated
based on the status
field.
// get relevant text and info from article html
function procpage(site, dbrow, fetch_res) {
var row = {
url:dbrow.url,
fetch_date:'now',
fetch_count: dbrow.fetch_count +1
};
var image, imgtag;
// get server date
if(typeof fetch_res.headers.Date == 'string')
row.server_date = scanDate(fetch_res.headers.Date, dateFmt);
else if(typeof fetch_res.headers.date == 'string')
row.server_date = scanDate(fetch_res.headers.date, dateFmt);
else
row.server_date = 'now';
var doc = html.newDocument(fetch_res.body);
// the content is located in an element with CSS class *site.contentClass*
var content = doc.findClass(site.contentClass);
// remove from content items we don't want
if(site.contentRemoveClass) {
for (var i=0;i<site.contentRemoveClass.length;i++) {
content.findClass( site.contentRemoveClass[i] ).delete();
}
}
// makeuseof has an easier to grab image in the article
if(site.contentImgClass) {
image=doc.findClass( site.contentImgClass );
if(image.hasTag("img")[0])
imgtag = image;
else
imgtag = image.findTag("img");
if(imgtag.length)
row.img_url = imgtag.getAttr("src")[0];
}
// extract the text from the content html
row.text = content.toText({concatenate:true, titleText: true});
// find the <title> tag text
var title = doc.findTag('title');
if(title.length)
row.title = title.eq(0).toText()[0];
return row;
}
// interval ID needed to cancel setInteral when all pages have been fetched;
var iId = {};
// status, fetch_count, server_date, fetch_date, site, url, img_url, text
//fetch, process and update table
function fetch_article(sitename) {
var row;
var res = sql.one("select * from pipages where site=? and status<1 and fetch_count<10 order by status DESC",
[sitename] );
if(!res) {
printf("%s is up to date\n", sitename);
// no more pages to fetch, so cancel interval
clearInterval(iId[sitename]);
delete iId[sitename]; // get rid of entry so we know we are done with it
// final action before script exits:
if(Object.keys(iId).length == 0) {
printf("updating fulltext index\n");
make_index();
}
return;
}
var site = sites[res.site];
// fetch the page
var cres = fetch(res.url);
if(cres.status != 200 ) {
// failed
update_row({
url:res.url,
status: cres.status,
fetch_count: res.fetch_count + 1
});
} else {
// success
row=procpage(site, res, cres);
row.status=200;
update_row(row);
}
}
// get article pages asynchronously using setInterval
function fetch_all_articles(){
for (var i = 0; i < sitenames.length; i++) {
var site = sites[sitenames[i]];
var res = sql.one("select * from pipages where site=? and status<1 and fetch_count<10 order by status DESC",
[sitenames[i]] );
if (res) { // only if we have some sites, otherwise we'd waste crawlDelay seconds just to exit
// this complicated mess is to make sure our *sitenames[i].name*
// is scoped to the setInterval callback and stays the same
// even as the *site* variable changes in subsequent *for* loops.
// if you are unfamiliar with this technique and immediately invoked functions - see:
// https://www.google.com/search?q=iife+settimeout
// scope sitename in an IIFE
(function(sn) {
iId[sn] = setInterval(
function() { fetch_article(sn);},
crawlDelay * 1000
);
})(site.name);
/* or use IIFE to return a function with sitename scoped. Result is the same.
iId[site.name] = setInterval(
(function(sn) {
return function() {
fetch_article(sn);
}
})(site.name),
crawlDelay * 1000
);
*/
}
}
}
The Final Aggregator Script¶
Putting everything above together, we end up with this script:
#!/usr/local/bin/rampart
rampart.globalize(rampart.utils);
var Sql = require("rampart-sql");
var curl = require("rampart-curl");
var html = require("rampart-html");
var robots = require("rampart-robots");
var urlutil= require("rampart-url");
var dbdir = process.scriptPath + "/web_server/data/pi_news"
var sql = new Sql.init(dbdir, true);
// The date format we are expecting from http servers
var dateFmt = "%a, %d %b %Y %H:%M:%S GMT";
var crawlDelay = 10; // go slow - delay in seconds between fetches from same site.
// Columns for our table
// status, fetch_count, server_date, fetch_date, site, url, img_url, title, text
var schema = "status smallint, fetch_count smallint, server_date date, fetch_date date, " +
"site varchar(8), url varchar(64), img_url varchar(64), title varchar(64), " +
"text varchar(1024)";
// running in two stages - first gets all articles from several index pages
// - second gets latest articles from main index page
// First stage is done by running with command line argument '--first-run'
var firstRun = false;
if ( process.argv[2] == '--first-run' )
firstRun=true;
/* details of each scrape:
url - the index page with links to the latest articles
urlNextFmt - more index pages with links to articles
initialPages - how many pages with links to articles to grab on first run
entryClass - the CSS class of the element holding article links, for index pages
entryImgClass - the CSS class of the element on the index pages with a related image
contentClass - on article page, the CSS of element that holds most relevant text
contentImgClass - the CSS class of the element on the content pages with a related image
contentRemoveClass - the CSS class of elements on the content pages inside contentClass that should be removed
*/
var sites = {
"hackaday":
{
name: "hackaday",
url: "https://hackaday.com/category/raspberry-pi-2/",
urlNextFmt: "https://hackaday.com/category/raspberry-pi-2/page/%d/",
initialPages: 8,
entryClass: "entry-featured-image",
contentImgClass: "entry-featured-image",
contentClass: 'post'
},
"raspberrypi":
{
name: "raspberrypi",
url: "https://www.raspberrypi.com/news/",
urlNextFmt: "https://www.raspberrypi.com/news/page/%d/",
initialPages: 5,
entryClass: "c-blog-post-card__link",
entryImgClass: "c-blog-post-card__image",
contentClass: "c-blog-post-content",
contentRemoveClass: ["c-blog-post-content__footer"]
},
"makeuseof":
{
name: "makeuseof",
url: "https://www.makeuseof.com/search/raspberry%20pi/",
urlNextFmt: "https://www.makeuseof.com/search/raspberry%%20pi/%d/",
initialPages: 3,
entryClass: "bc-title-link",
entryImgClass: "bc-img",
contentClass: "article",
contentRemoveClass: ["sidebar", "next-btn", "sharing", "letter-from"]
}
}
// the names of our sites.
var sitenames = Object.keys(sites);
// status < 0 -- try or retry
// status > 0 -- don't retry
// status = 0 -- first try
// status > 0 and < 100 - custom codes
var statTexts = {
"1": "Disallowed by robots.txt",
"2": "Cannot parse Url",
"-1": "Error retrieving robots.txt"
}
var robotstxt = {}; //load these once per run
var userAgent = "rampart_tutorial"; //our identification
// fetch with robots.txt check
function fetch(url) {
var comp = urlutil.components(url);
var origin, res;
if(!comp)
return {status: 2, statusText: statTexts[2]};
origin=comp.origin;
if(!robotstxt[origin]) {
var rurl = origin + '/robots.txt'
printf("fetching %s\r", rurl);
fflush(stdout);
// body is a buffer. robots.isAllowed also takes a buffer, so we can dispense with text.
res = curl.fetch(rurl, {"user-agent": userAgent, returnText: false, location:true});
printf("%d - %s\n", res.status, rurl);
if(res.status==200) {
robotstxt[origin]=res.body;
} else if (res.status > 399) {
robotstxt[origin]=-1;
} else {
// there are other possibilities not covered here
return {status: -1, statusText: statTexts[-1]};
}
}
// if there is a robots.txt, and this url is disallowed, return status=1
if(robotstxt[origin]!=-1 && !robots.isAllowed(userAgent, robotstxt[origin], url) ) {
return {status: 1, statusText: statTexts[1]};
}
printf("fetching %s\r", url);
fflush(stdout);
res = curl.fetch (url, {"user-agent": userAgent, returnText: false, location:true});
printf("%d - %s\n", res.status, url);
if(res.status > 499) res.status *= -1; //return negative on serve error, so we will retry
return res;
}
// process the index page holding links to articles
function procIndex(site, docBody) {
var doc = html.newDocument(docBody);
var entries = doc.findClass(site.entryClass);
var images;
if(site.entryImgClass) {
images = doc.findClass(site.entryImgClass);
}
var insertRows = [];
for (var i=0; i<entries.length; i++) {
var row={site: site.name};
var entry = entries.eq(i);
var atag;
if(entry.hasTag("a")[0])
atag=entry;
else
atag = entry.findTag("a");
row.url = atag.getAttr("href");
if(row.url.length)
row.url = urlutil.absUrl(site.url, row.url[0]);
if(images && images.length) {
var image = images.eq(i);
if (image.length) {
var imgtag = image.findTag("source"); //makeuseof
if(imgtag.length) {
imgtag = imgtag.eq(0);
var srcset = imgtag.getAttr("data-srcset");
if(srcset[0]=='') {
srcset = imgtag.getAttr("srcset");
}
if(srcset.length)
row.img_url = srcset;
} else { //raspberrypi
if(image.hasTag("img")[0])
imgtag = image;
else
imgtag = image.findTag("img");
row.img_url = imgtag.getAttr("src");
}
}
if(row.img_url && row.img_url.length)
row.img_url = row.img_url[0];
}
insertRows.push(row);
}
return insertRows;
}
// get single key response without \n
function getresp(def, len) {
var l = (len)? len: 1;
var ret = stdin.getchar(l);
if(ret == '\n')
return def;
printf("\n");
return ret.toLowerCase();
}
// drop table if exists, but ask first
function clean_database() {
if( sql.one("select * from SYSTABLES where NAME = 'pipages';") ) {
printf('Table "pipages" exists. Drop it and delete all saved pages?\n [y/N]: ');
fflush(stdout); //flush text after \n
var resp = getresp('n'); //default no
if(resp == 'n') {
process.exit(0);
}
sql.exec("drop table pipages;");
if(sql.errMsg != '') {
printf("Error dropping table:\n%s",sql.errMsg);
process.exit(0);
}
}
}
function create_table() {
// create using schema
sql.exec(`create table pipages (${schema});`);
// unique index on url, so we don't have duplicates
sql.exec('create unique index pipages_url_ux on pipages(url);');
}
// update a row of data in our table
function update_row(data) {
var keys = Object.keys(data);
// build our "set" section of sql statement
// based on what we have in *data*
// should look like "text=?text, title=?title, ..."
var sqlstr="update pipages set ";
var j=0;
for (var i = 0; i<keys.length; i++) {
var key = keys[i];
if(key == 'url')
continue;
if(j)
sqlstr += ", "+key + "=?" + key;
else
sqlstr += key + "=?" + key;
j++;
}
// if we only have data.url, there's nothing to update
if(!j)
return;
printf("updating %s\n",data.url);
var res = sql.exec(sqlstr + " where url=?url;", data);
if(sql.errMsg) {
printf("sql update error, cannot continue:\n%s",sql.errMsg);
process.exit(1);
}
return res;
}
//properties we need in the substitution parameters object when doing an insert
var empty_params = {
status:0,
fetch_count:0,
server_date:0,
fetch_date:0,
img_url:"",
title:"",
text:""
}
// insert a new row into our table
function insert_row(data) {
var dfilled = {}; // start with empty object
Object.assign(dfilled, empty_params); // add default empty params
Object.assign(dfilled, data); // overwrite params with what we have
var res = sql.exec("insert into pipages values " +
"(?status, ?fetch_count, ?server_date, ?fetch_date, ?site, ?url, ?img_url, ?title, ?text);",
dfilled
);
// if duplicate, sql.errMsg will be set with "duplicate value" message and res.rowCount == 0
// Checking res.rowCount first is likely faster
if (!res.rowCount && sql.errMsg.includes("Trying to insert duplicate value") ) {
res.isDup=true;
// remove sql.errMsg
sql.errMsg = '';
}
return res;
}
// check for new articles in the main index page (sites[].url)
function check_new_articles() {
// check that our table exists
if( ! sql.one("select * from SYSTABLES where NAME = 'pipages';") ) {
console.log(
'Error: table "pipages" does not exist.\n' +
'If this is the first time running this script, run it with\n' +
`${process.argv[1]} --first-run`
);
process.exit(1);
}
for (var i=0; i< sitenames.length; i++) {
var site = sites[sitenames[i]];
var res;
printf("Getting new article urls for %s\n", site.url);
res = fetch(site.url);
var urllist = procIndex(site, res.body);
for (var j=0; j<urllist.length; j++) {
printf("checking %s - \r", urllist[j].url)
fflush(stdout);
var sqlres = insert_row(urllist[j]);
if(sqlres.isDup)
printf("exists: %s\n", urllist[j].url);
else
printf("NEW ARTICLE: %s\n", urllist[j].url);
}
}
}
// on first run, get main index page and a few more index pages
function first_run() {
var i,j,k;
clean_database();
create_table();
for (i=0; i< sitenames.length; i++) {
var site = sites[sitenames[i]];
var res;
printf("Getting new article urls for %s\n", site.url);
res = fetch(site.url);
if(res.status != 200) {
printf("error getting '%s':\n %s\n", site.url, res.statusText);
process.exit(1);
}
// extract urls of articles
var urllist = procIndex(site, res.body);
// insert urls into table
for (j=0; j<urllist.length; j++) {
insert_row(urllist[j]);
}
// go slow
sleep(crawlDelay);
// get second and subsequent pages up to site.initialPages
for (k=2; k<=site.initialPages;k++) {
var purl = sprintf(site.urlNextFmt, k);
res = fetch(purl);
if(res.status != 200) {
printf("error getting '%s':\n %s\n", site.url, res.statusText);
process.exit(1);
}
// extract urls of articles
var urllist = procIndex(site, res.body);
// insert urls into table
for (j=0; j<urllist.length; j++) {
insert_row(urllist[j]);
}
//sleep unless at end
if( k < site.initialPages)
sleep(crawlDelay);
}
}
}
// get relevant text and info from article html
function procpage(site, dbrow, fetch_res) {
var row = {
url:dbrow.url,
fetch_date:'now',
fetch_count: dbrow.fetch_count +1
};
var image, imgtag;
// get server date
if(typeof fetch_res.headers.Date == 'string')
row.server_date = scanDate(fetch_res.headers.Date, dateFmt);
else if(typeof fetch_res.headers.date == 'string')
row.server_date = scanDate(fetch_res.headers.date, dateFmt);
else
row.server_date = 'now';
var doc = html.newDocument(fetch_res.body);
// the content is located in an element with CSS class *site.contentClass*
var content = doc.findClass(site.contentClass);
// remove from content items we don't want
if(site.contentRemoveClass) {
for (var i=0;i<site.contentRemoveClass.length;i++) {
content.findClass( site.contentRemoveClass[i] ).delete();
}
}
// makeuseof has an easier to grab image in the article
if(site.contentImgClass) {
image=doc.findClass( site.contentImgClass );
if(image.hasTag("img")[0])
imgtag = image;
else
imgtag = image.findTag("img");
if(imgtag.length)
row.img_url = imgtag.getAttr("src")[0];
}
// extract the text from the content html
row.text = content.toText({concatenate:true, titleText: true});
// find the <title> tag text
var title = doc.findTag('title');
if(title.length)
row.title = title.eq(0).toText()[0];
return row;
}
/*
Regular indexes are updated on each insert/update.
Text indexes, however, need to be manually updated.
- When a new row is inserted, it still is available for search,
but that search is a linear scan of the document, so it is slower.
- A text index can be updated with either:
1) sql.exec("alter index pipages_text_ftx OPTIMIZE");
or
2) issuing the same command that created the index as in make_index() below.
Here we will issue the same command when creating and updating for simplicity.
*/
function make_index(){
// we want to match "Check out the new pi 4." if we search for "pi 4"
// so we add an expression for [\s\,]\d+[\s,\.] (in PERLRE), but only matching the number(s)
sql.exec(
"create fulltext index pipages_text_ftx on pipages(text) " +
"WITH WORDEXPRESSIONS "+
"('[\\alnum\\x80-\\xFF]{2,99}', '[\\space\\,]\\P=\\digit+\\F[\\space,\\.]=') "+
"INDEXMETER 'on'");
}
// interval ID needed to cancel setInteral when all pages have been fetched;
var iId = {};
// status, fetch_count, server_date, fetch_date, site, url, img_url, text
//fetch, process and update table
function fetch_article(sitename) {
var row;
var res = sql.one("select * from pipages where site=? and status<1 and fetch_count<10 order by status DESC",
[sitename] );
if(!res) {
printf("%s is up to date\n", sitename);
// no more pages to fetch, so cancel interval
clearInterval(iId[sitename]);
delete iId[sitename]; // get rid of entry so we know we are done with it
// final action before script exits:
if(Object.keys(iId).length == 0) {
printf("updating fulltext index\n");
make_index();
}
return;
}
var site = sites[res.site];
// fetch the page
var cres = fetch(res.url);
if(cres.status != 200 ) {
// failed
update_row({
url:res.url,
status: cres.status,
fetch_count: res.fetch_count + 1
});
} else {
// success
row=procpage(site, res, cres);
row.status=200;
update_row(row);
}
}
// get article pages asynchronously using setInterval
function fetch_all_articles(){
for (var i = 0; i < sitenames.length; i++) {
var site = sites[sitenames[i]];
var res = sql.one("select * from pipages where site=? and status<1 and fetch_count<10 order by status DESC",
[sitenames[i]] );
if (res) { // only if we have some sites, otherwise we'd waste crawlDelay seconds just to exit
// this complicated mess is to make sure our *sitenames[i].name*
// is scoped to the setInterval callback and stays the same
// even as the *site* variable changes in subsequent *for* loops.
// if you are unfamiliar with this technique and immediately invoked functions - see:
// https://www.google.com/search?q=iife+settimeout
// scope sitename in an IIFE
(function(sn) {
iId[sn] = setInterval(
function() { fetch_article(sn);},
crawlDelay * 1000
);
})(site.name);
/* or use IIFE to return a function with sitename scoped. Result is the same.
iId[site.name] = setInterval(
(function(sn) {
return function() {
fetch_article(sn);
}
})(site.name),
crawlDelay * 1000
);
*/
}
}
}
// get index pages
if(firstRun) {
first_run();
} else {
check_new_articles();
}
// get article pages
fetch_all_articles();
Client-Side Script¶
The client-side script is fairly easy and self explanatory compared to the aggregator script.
Here we edit the search.js
file and add some styling using Bootstrap:
/* The PI NEWS Demo Search Module
Everything outside the 'search()' function below is run once (per thread) when
the module is loaded. The 'module.exports=search` is the exported function
which is run once per client request.
Note that this means that variables set outside the 'search()' function are
long lived and span multiple requests by potentially multiple clients. As
such, care should be taken that these variables do not include any information
which should not be shared between clients.
*/
// Load the sql module. This only needs to be done once
var Sql=require("rampart-sql");
// process.scriptPath is the path of the web_server_conf.js, not
// the path of this module script. For the path of this module,
// use 'module.path'.
var db=process.scriptPath + '/data/pi_news';
// Open the db.
var sql=new Sql.init(db);
// make printf = rampart.utils.printf
// See: https://rampart.dev/docs/rampart-main.html#rampart-globalize
rampart.globalize(rampart.utils);
/*
Example of some settings to modify search weights which can be used to use
to tune the search results. These values are just examples and are not
tuned for this search.
See: https://rampart.dev/docs/sql-set.html#rank-knobs
and https://rampart.dev/docs/sql-set.html#other-ranking-properties
sql.set({
"likepallmatch":false,
"likepleadbias":250,
"likepproximity":750,
"likeprows":2000,
});
NOTE ALSO:
Here the 'sql' variable is set when the module is loaded. Any changes
made in 'search()' below using sql.set() will be carried forward for
the next client (per server thread). If you have settings (as set in
'sql.set()') that change per request or per client, it is highly
advisable that you call 'sql.reset()' at the beginning of the exported
'search()' function below followed by a `sql.set()` call to make the
behavioral changes desired.
If the sql.set() call is intended for every client and every search,
setting it here is not problematic.
*/
// we want "pi 4" query to match "Check out the new pi 4."
// so we set qminwordlen to 1, the shortest word in a query (normally 2).
// We also set wordexpressions to:
// ('[\\alnum\\x80-\\xFF]{2,99}', '[\\space\\,]\\P=\\digit+\\F[\\space,\\.]=')
// in the scraping script so that single digits will be indexed.
// In another context, this might be counter productive as it polutes the index, but
// here we have a small table and a greater need to match single digits.
sql.set({ "qminwordlen":1 });
/*
the top section of html only needs to be set once as it remains the same
regardless of the request. Here it is set when the module is first loaded.
This is a printf style format string so that the query text box may be
filled if, e.g., ?q=pi+4 is set.
The sprintf(%w', format_string):
This removes leading white space so it can be pretty
here in the source but compact when sent.
See https://rampart.dev/docs/rampart-utils.html#printf
*/
var htmltop_format=sprintf('%w',
`<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-1BmE4kWBq78iYhFldvKuhfTAU6auU8tT94WrHftjDbrCEXSU1oBoqyl2QvZ6jIW3" crossorigin="anonymous">
<style>
body {font-family: arial,sans-serif;}
.urlsp {color:#006621;max-width:100%%;overflow: hidden;text-overflow: ellipsis;white-space:nowrap;display:inline-block;font-size:.90em;}
.urla {text-decoration: none;font-size:16px;overflow: hidden;text-overflow: ellipsis;white-space:nowrap;display:inline-block; width: 100%%; }
.res {margin-top: 80px !important;}
.img-cover{object-fit: cover; aspect-ratio:5/3}
</style>
</head>
<body>
<nav style="z-index:1" class="navbar position-fixed top-0 w-100 navbar-expand-lg navbar-light bg-light">
<div class="container-fluid">
<a class="navbar-brand m-auto p-2" href="#">
【Rampart : Pi News】
</a>
<form style="width:100%%" id="mf" action="/apps/pi_news/search.html">
<div class="input-group mb-2">
<input autocomplete="off" type="text" id="fq" name="q" value="%H" placeholder="Search" class="form-control">
<button class="btn btn-outline-secondary" type="submit">Search</button>
</div>
</form>
</div>
</nav>
`
);
function search(req) {
var q=req.query.q ? req.query.q: "";
// req.query.skip in, e.g. "/apps/pi_news/search.html?q=pi&skip=10" is text.
// Make it a JavaScript number.
var skip=parseInt( req.query.skip );
var icount=0; //estimated total number of results, set below
var endhtml; // closing tags, set below
var nres=12; // number of results per page
// add the htmltop text to the server's output buffer.
// See: https://rampart.dev/docs/rampart-server.html#req-printf
// it includes escaped '%%' values and the 'value="%H"' format code for the query
req.printf(htmltop_format, q);
if (!skip)skip=0;
var sqlStatement;
// if there is a query, search for it and format the results.
// if not, just send the latest articles.
if(req.query.q) {
/* The SQL statement:
%mbH in stringformat() means highlight with bold and html escape.
See: https://rampart.dev/docs/rampart-sql.html#metamorph-hit-mark-up
https://rampart.dev/docs/sql-server-funcs.html#stringformat
abstract(text[, maxsize[, style[, query]]]) will create an abstract:
- text is the table field from which to create an abstract.
- 0 (or <0) means use the default maximum size of 230 characters.
- 'querymultiple' is a style which will break up the abstract into multiple sections if necessary
- '?' is replaced with the JavaScript variable 'q'
*/
sqlStatement = "select url, img_url, title, stringformat('%mbH',?query,abstract(text, 0,'querymultiple',?query)) Ab from pipages where text likep ?query";
} else {
/* if no query, get latest articles */
sqlStatement = "select url, img_url, title, abstract(text) Ab from pipages order by server_date DESC";
}
// by default, only the first 100 rows are returned for any likep search.
// if we are skipping past that, we need to raise the likeprows setting.
if(skip + nres > 100 )
sql.set({likeprows:skip + nres});
else
sql.set({likeprows:100}); //reset to default in case previously set
// sql.exec(statement, params, settings, callback);
sql.exec(
sqlStatement,
// the parameters for each '?query' in the above statement
{query: q},
// options
{maxRows:nres,skipRows:skip,includeCounts:true},
// callback is executed once per retrieved row.
function(res,i,cols,info) {
/* res = {url: www, img_url:xxx, title:"yyy", Ab:"zzz"}
* i = current row, beginning at skip, ending at or before skip + nres
* cols = ["url", "img_url", "title", "Ab"] - the columns returned from the SQL statement
* includeCounts sets info to an object detailing the number of possible matches to a "likep" query. */
// before the first row
if(i==skip) {
icount=parseInt(info.indexCount);
req.printf('<div class="res m-3">');
if(req.query.q)
req.printf('<div class="m-5 mb-0">Results %d-%d of about %d</div>',
skip+1,(skip+nres>icount)?icount:skip+nres,icount
);
else
req.printf('<div class="m-5 mb-0">Latest Articles</div>');
req.printf('<div class="row row-cols-md-3 row-cols-sm-2 row-cols-1 g-4 m-3 mt-0">');
}
req.printf('<div class="col"><div class="card">'+
'<a target="_blank" href="%s">'+
'<img class="card-img-top img-cover" src = "%s">' +
'</a>' +
'<div class="card-body">'+
'<a class="urla tar" target="_blank" href="%s">%s</a>'+
'<span class="urlsp">%s</span>'+
'<p class="card-text">%s</p>'+
'</div>'+
'</div></div>',
res.url, res.img_url, res.url, res.title, res.url, res.Ab);
}
);
// check if there are more rows. If so, print a 'next' link.
if (icount > nres+skip) {
skip+=nres
// %U is for url encoding. See https://rampart.dev/docs/rampart-utils.html#printf
req.printf('</div><div class="m-3 mt-0">' +
'<a class="m-3" href="/apps/pi_news/search.html?q=%U&skip=%d">Next %d</a>' +
'<div class="m-3"> </div></div>',
req.query.q,skip, (nres > icount - skip ? icount - skip: nres)
);
}
endhtml='</div></div></body></html>';
// send the closing html and set the mime-type to text/html
// This is appended to everything already sent using req.printf()
return({html:endhtml});
// alternatively, it might be sent like this:
// req.put(endhtml);
// return({html:null}); //null means just set the mime-type, but don't append
}
//export the main search function
module.exports=search;
// -fin-
Improvements¶
We purposely kept these examples very simple in order to clearly demonstrate the concepts we covered, so there is room for improvement:
- Save the HTML of each article in our table.
- Include a more robust treatment of HTTP status codes.
- Error check the results of the parsed properties before insert or update.
- Optimize the update of the fulltext index with a conditional, e.g.,
ALTER INDEX pipages_text_ftx OPTIMIZE HAVING COUNT(NewRows) > 30;
so that there is a balance between the time it takes to optimize the index and number of rows that will be linearly search. This would only be an issue after the database grows much larger than anticipated in this tutorial. - Create a better strategy should the fetch of
/robots.txt
not return200
. - Examine
sql.errMsg
after each call tosql.exec()
. - Have a different format for the latest articles and the search results (so that search results look more like the return from a search).
Enjoy!