Quickie: Data Mining on the Web with PERL
This is a simple web mining script using perl.
#!/usr/bin/perl
use LWP::Simple;
$numPages=$ARGV[0];
open OUTPUT,">/home/user/out.html";
for($i=1;$i<=$numPages;$i++){
print $i."\n";
$content=get("http://coderswasteland.com/node/".$i);
print OUTPUT $content;
print OUTPUT "******************\n";
}
close OUTPUT;
This code is useful to pull information from any Drupal website. It takes the number of pages to crawl as a command line argument and uses that to increment through the site, grabbing articles. The content is then all saved to a single output file, each site separated by a series of asterisks from which you may do whatever regex or parsing is necessary to achieve your desired result.
I have a much longer version of this which does parsing and builds feeds. You may also wish to forgo writing to an output file then later reading from it and just process the information from within $content, split it into logical pieces, etc. This script is merely to get you started. If you're looking into web mining, I assume you already know about parsing and other topics, and this is just a quick way to grab web content.
A spiffier version would be to follow links on a page and popping info onto a tree rather auto-incrementing a URL. this is left as an exercise for the reader :)
Quickie: Batch Renaming Files in Linux
For those wanting to be able to quickly and efficiently rename a bunch of files all at once, below is some example code to help you achieve that:
ls *.xml | sed 's/\(testing\)\(.*\)/mv \1\2 production\2/' | sh
The first piece ls *.xml simply lists the file types we want to change. You can, of course, alter this to anything you'd like.
The second part uses sed so search for a text and replace it with another. There are numerous regex tutorials around the net.
Within the second part, items are grouped so that I may reuse these items in the next part. Specifically, I am looking for (part1) the word "testing" and (part2) anything else. The mv command then takes as the first parameter the original two values to denote what file is to be changed and the second parameter of the command does not use the first variable but rather the word "production" and then appends whatever was in the second variable.
Part 3 simply executes this.
The end result makes a sample file named:
testing_1.xml
be
production_1.xml
The Software Development Process as Compared to Traditional Manufacturing - Scribd
Software Process Model and Metrics Adoption for Small Software Organizations - Scribd Version
Quickie: Burning Video DVDs Command Line Linux
Make sure you have dvdbackup and growisofs/mkisofs. (Your usual sudo apt-get install in Debian/Ubuntu).
Simply:
dvdbackup -i /dev/dvd -M -o </your/output/path> growisofs -speed 1 -dvd-compat -Z /dev/dvdrw -dvd-video </your/output/path/videodir>
Example:
dvdbackup -i /dev/dvd -M -o /home/steven/Documents/VID/ growisofs -speed 1 -dvd-compat -Z /dev/dvdrw -dvd-video /home/steven/Documents/VID/MY_BACKUP_VIDEO/
The second path is where the first command wrote to. Basically it will be one folder deeper than that first path. It is the path that contains the VIDEO_TS directory.
Referenced from:
http://ubuntuforums.org/showthread.php?t=133642
Software Process Model and Metrics Adoption for Small Software Organizations
PDF Available here.
TITLE
Software Process Model and Metrics Adoption for Small Software Organizations
ABSTRACT
This paper discusses reasons why small organizations often do not adopt software process models and metrics that are more prevalent and expected in larger organizations and which models they may adopt or alter in order to obtain manageability that a full-scale software process models and metrics solution offers. It addresses why smaller software organizations may be apprehensive to adopt the models, why many existing models do not work, what models may be applied and how, why they should adopt a model, and ultimately successes found after software process model adoption.
Quickie: Multiple File Search and Replace
If you've ever run into the necessity to change the same data over multiple files you know how hard it is to go through each and every file and update it by hand or at best do an indiviual regex on each. Here's how you can make the changes all in on fell swoop.
perl -pi -w -e 's/searchregex/replaceregex/g;' *.fileextension
-w display any warnings
-i in place edit
-p loop over files
-e execute this line of code
Here's an example:
perl -pi -w -e 's/\x0D//g;' *.txt
This looks for those pesky ^M characters and replaces them with nothing.
The Software Development Process as Compared to Traditional Manufacturing
In the STSC CrossTalk article What Engineering Has in Common With Manufacturing and Why It Matters, Dr. Alistair Cockburn explains that software engineering has common threads with traditional manufacturing. He states that decisions in the software development cycle are analogs of parts in a manufacturing line in that “both flow through a network, wait in queues at bottlenecks, [and] have throughput delays”.
Quickie: Strip ^M (Control-M) Characters from Input File with PERL
For anyone that does file I/O and has to sometimes work with Windows-generated files in Linux, I feel your pain. Windows has these little nuances that sometimes makes our GNU/Linux world a fun place to live. Luckily, PERL has a simple little system in place that allows us to remove control characters - Regular Expressions. Those not familiar will find great references at http://www.regular-expressions.info/ (a fantastic place to begin) and http://www.regextester.com/ (where you can test your brilliant work).
People who just want a quick piece of code, look no further:
If you want to do it all in one line from the CLI (of course replace *.txt with whatever extension):
perl -pi -w -e 's/\x0D//g;' *.txt
If you'd rather do it inline in a Perl script:
#Good Code $yourLine =~ s/\x0D//g; #strips ^M characters
Simply trying to strip ^M characters with
#Bad Code $yourLine =~ s/\^M//g; #strips ^M characters
unfortunately does not work. The previous hex value works great for me. I've run into this problem many times while taking third party data feeds which are sometimes generated in Windows and trying to preprocess them in my GNU/Linux environment. The ^M sends gets interpreted as a new line and wreaks havoc on feeds where you expect all the information to be on one line in a fixed number of columns.
For more information on control characters, please go to: http://www.cs.tut.fi/~jkorpela/chars.html where Jukka Korpela explains in-depth what control characters are, what issues you may have, and how you may go about resolving them.
- « first
- ‹ previous
- 1
- 2
- 3


![[FSF Associate Member] [FSF Associate Member]](http://www.ossolutions.org/lores/img/fsfMember.png)
Subscribe to this Feed