Tools for Website Content Cleanup? - resource-cleanup

I am working with a client to migrate a web site from the existing production hardware into a new hardware environment. Now seems like an excellent time to perform an audit and remove any old or obsolete content rather than just blindly copy it again.
Are there any good free tools or scripts I can use to compare the web accessible content on a server to the actual files on a server to see what content is actually being linked to and used?
Thanks in advance for any help!

Well, for starters you can use a tool like Xenu's Link Sleuth to spider all of your pages to find broken links and the like. We used this tool on our intranet to find and fix our broken links. It's free and gets the job done.
Another tool that we have used for migrations between systems is a search engine. A good search engine will spider all of your pages and show the two-way relationship between links. This can help you find what content is being linked to the most and what is possibly orphaned. Unfortunately, these kinds of tools are not free.

I'm sure there is but I'm sure there isn't one that could do a better job than you could yourself, ya know? How big is this site and did you code it yourself?


Need to retrieve content from specific news sources / blogs etc. Third party software, or build my own?

Looking for some guidance. I've got a requirement to get article content from specific sources that will be used for data analysis in a nutshell. So we've got to get the latest articles, and store them in our database for processing later on.
I'm not sure really sure of the best approach. Our code for current news retrieval (from a newsfeed provider) runs from C on UNIX. Basically using CURL and parsing the XML for storage in a database.
But the solution I need now is different. Every website is different obviously. Basically I just want to be able to have a cron job that will call something that will get the latest articles from the relevant website as required.
Any ideas appreciated. I'm also currently looking at AutomationAnywhere perhaps as a quick solution if it works for us.
iMacros is a good solution for web scraping.
You can run iMacros for Firefox (free/open-source) on Linux and control it via the command line.
On Windows you can also use the paid Scripting Edition, which gives you extracting wizards and support for Flash automation etc.
Take a look at the IRobotSoft visual web scraper. It will give you a quick start.
Since each website is different it would take a lot of effort to setup a robust scraping solution. A simple alternative is to find the RSS / Atom feed for each website so you can extract article content in a consistent format. If no feed is available for a website then could skip or try scraping.

Embed Google/ Yahoo search into a web site or build your own

I am looking for an opinion on the whether to use Google custom search, Yahoo search builder or build my own for web projects (no more than 100 pages of content). If I should build my own - do you have any fast start kits you could recommend?
Many thanks
I have had success using OpenSearch for my personal blog.
While working at BigCorp we used dedicated search applicances in yellow boxes, but in your case (around 100 pages) it does not make sense to take such a route.
I would suggest going with either Google Custom Search, or Yahoo Search Builder (as long as they both index your site sufficiently to provide good results).
More often than not, you'll get better results and you don't have to worry about building your own custom engine (or implementing an off the shelf/open source piece of software to do the job for you).
I've used IBM OmniFind Yahoo Edition and had fantastic results with it. You are limited to a single index per implementation but it's very fast and easy to integrate with and extensible in terms of search customization. I've used it with a ASP.NET site without issue. A caveat being that it needs to be installed on the server and running as a service so it is out of the question for most shared hosting. It has the index capabilities of general search engines (pdf/html/etc) which is very nice.
I forgot to mention that some of the reasons I liked it vs other options is that it is free and doesn't require any additional hardware, just FYI.
The main situation I see Google/Yahoo as being sub-optimal is when your site relies on up-to-the-minute results. You're at the mercy of their crawling policies/speed/etc. If that's okay (and I suspect it will be for most 100ish page sites), use them - the results will be great. If realtime results are important, you may have to bite the bullet and install something locally.
Yahoo boss is cheaper and recommended by many people
I am going to integrate it soon.

Recommended Web Based Time/Task Management Solution For Personal Use?

I know this is less programming related and more time management related, but I value the feedback of the users on this site. I'm finding myself particularly busy this semester, managing various tasks and timelines between work and school. Further, I find myself running around between labs, work, home, libraries, etc. For these reasons I think a web based solution is ideal. Which leads to the question, do you have a recommended web based solution for task/project management for personal use? Ideal would be free (or nearly), and one which I could install on my server. Accounting is not a requirement, just management of time and tasks (gantt charts would be great). However, svn integration would be really good as I keep my school work in there.
I put a bounty on this, as a good solution would be very valuable to me (more so than the rep). The answers so far have been great, but none fit just right for personal use. Ideal would be something that manages files and time with something I could self host. It would be a plus for MAC+PC solutions as we use MAC OSX in the lab on campus. Currently SVN and a web based time manager seems to be the way to go.
First, thank you for the great responses! FogBugz, Trac, and Request Tracker were either strongly recommended or suggested more than once. Trac and Request Tracker are also free and self hosted, however, their strengths seem to lay in team development. I’m going to give FogBugz on Demand a shot and see how that works out. I am also going to start using the Drop Box as suggested, great idea! Further, SVN will be used for the longevity of ‘everything’. This should walk around the 2 gig issue and self hosting desire. I am considering SVNNotifier for keeping machines current. Thanks again!
I have decided to use FogBugz as suggested by Zabbala. As he said, it really does everything I want from task tracking to time management. It’s an amazing and generous free offering from FogCreek. Thank you everyone, I really appreciate all of the feedback.
Just to follow up on this item. I have been using FogBugz for two weeks now; couldn’t be happier. I have started using LiveScribe’s SmartPen in conjunction with FogBugz. I am keeping longer term items within FogBugz, and copy the current week’s items in a ‘journal’ for the road. The SmartPen makes digitizing the 'journal' painless.
For managing files I am using a mixture between FogBugz, DropBox, SVN, and Unison. I use DropBox to share across the net, say between home and the lab, and Unison to synch up the DropBox folder within the larger SVN working copy (not everything is in DropBox).
I've been using fogbugz much less in favor of the LiveScribe's journal. Seems that the traditional pen and paper has an efficiency to it that's hard to beat. Regarding LiveScribe, I have had a few problems with it which has left me uncomfortable with their file format. Their files are obscured in both the naming convention and the format. If they had an open format I would feel much better about entrusting it with my data.
I've switched to using wikimedia to document my work/time/research. For task management I have been using the Google app Insightly. The combination feels more natural and has 'stuck'. The wiki route is really useful...
I've switched to using Trello boards for task management. I've drifted away from using wikimedia after a mistake on my part caused the database to be lost.
I would use Fogbugz OnDemand (here). It's free for 2 users and does everything you want from task tracking to time management. I use it myself for various pet projects and it meets all my requirements, plus it's extremely easy to set up.
Try rememberthemilk. Good tool and has number of useful interfaces.
When it comes to working across many computers, I love Dropbox ( A free account gives you an ample 2 GB of space synchronized across your computers (Windows, Mac, Linux). This won't solve your time management problems, but it could be the cornerstone of another solution. So if you find a desktop application that you like, you might be able to synchronize the files across your machines using Dropbox and make it a "web" solution.
For example, KeePass ( works really well with Dropbox. You can synchronize your encrypted password database across computers so your passwords are always up-to-date.
This whole scheme was introduced to me by Lifehacker, by the way.
Backpack is a good one from 37signals. They have free accounts, multiple users and an API. I am not sure about SVN access though.
You also might think about TRAC. It plugs in well with SVN, and although is made more for development, it would work well for your needs I think.
Update: You mentioned that Trac is geared towards teams, and while true, I don't think that is really a bad thing. I don't think there are any features in it that really require multiple users or that would slow you down from using it on your own. And if you ever need to collaborate with someone else it will already be set up to do it.
I'm currently evaluating TargetProcess and it seems really nice! It's an Agile project management application so it might do more than what you want.
I was in almost the same situation you are about 6 months ago. I was overwhelmed with keeping track of my projects and tasks and needed something that would enable me to track projects, the subtasks involved, and my progress on them. I also needed something that would let me collaborate with others as necessary and that was customizable.
I'm a developer, so I knew that SVN was a must. I wanted a PM system that integrated with SVN and would have preferred it to be self-hosted. I started out with Fogbugz on Demand just to give it a try, but it was overkill for my needs and I never felt like I was using it as I should have. Don't get me wrong, the system is beautifully constructed and is better than most PM tools out there, but it wasn't for me.
After trying a bunch of other options, I finally decided on Redmine. It is a PM system built on Ruby on Rails and it is flexible, decent looking, and reasonably fast. It will auto-create SVN repositories for each project you create (if you set it up appropriately) and does Gantt charting for you. Redmine as a PM system for tracking projects and tasks is amazing. The only thing I didn't like was it's lack of a timing system. There is manual time entry, but I wanted a widget to click like a stopwatch to track my time.
I decided to use Harvest as my time tracking solution. They have widgets available for Windows Vista and OS X that make it easy to stay on top of tracking your time. You'll have to set up your projects and clients (sounds like your client is yourself, so you won't have many) in Harvest, but after that you should be good to go. They have a phenomenal set of reports that you can view at any time to see where you're spending your time.
So, that's pretty much it. I use Redmine + Harvest pretty much every day and I haven't been happier.
Multiple projects support
Flexible role based access control
Flexible issue tracking system
Gantt chart and calendar
News, documents & files management
Feeds & email notifications
Per project wiki
Per project forums
Time tracking
Custom fields for issues, time-entries, projects and users
SCM integration (SVN, CVS, Git, Mercurial, Bazaar and Darcs)
Issue creation via email
Multiple LDAP authentication support
User self-registration support
Multilanguage support
Multiple databases support
I am loving redmine
Have you heard about this: Tiddly Wiki?
We are using Request Tracker. It is free, and has an API.
If you can deal with not hosting it yourself --
I used in the past and loved them. But, they aren't free anymore. I have moved my project over to which has the same task management features as well as SVN.
Use the Tasks in GMail. They are useful, prett ylightweight, you can have a hierarchy of tasks. Good if you are already using GMail.
There is also a Remember The Milk plugin for GMail. Here you can't have sub-tasks, but it's pretty good too, all in all.
Tracks is a ruby based time tracker that follows the Getting Things Done™ methodology. You can either host it on your own web server, or if you have ruby installed on all the computers you plan to use it from you can run it from a flash drive. It lets you set due dates so it will show you your most pressing task. It has several different methods to organize things which gives you alot of flexibility. Here's a screen cast and some screen shots.
axosoft provide a free personal license for their OnTime 2009 pro application. Has a Visual Studio addon, windows and web ui. I use it myself.
Fogbugz has a few plugins like timesprite that let you work on the cases in the system but track the time independently if you want to.
I highly recommend Request Tracker, as did J.J. It can be hosted yourself, and I believe it runs on Windows. (Since it runs on UNIX, it should run on Mac OS.) I don't know of any Gantt chart functionality for it, but I'll bet there are reports for it that could do that.
I'm just answering to plug Request Tracker, not for the bounty. If for some reason you decide to go with RT, make sure you give J.J. the bounty, as he recommended it first!
Agilo for Scrum seems like a good Agile Trac plugin to try.
PositiveWare does a lot of those things: Time management, project management, to do lists, budgeting, simple invoice creation and reporting.
No SVN integration, but it is web based (with an AIR app for power users) so you actually wouldn't have any software to install.
It's generally geared towards PR / Marketing firms, but I use it for my software development group.
I like Google Calendar - you can put in all your deadlines, meetings, appointments etc. Its web based and free. You can have multiple accounts on the same calendar like work and home and it will even send reminders with sms.
There is a Remember the Milk plugin of course for your todo list.
Get yourself a free DropBox account with 2GB storage space (PC or Mac).
Then copy (or create) a free TiddlyWiki in your My DropBox folder.
A TiddlyWiki is a single self contained/updateable html file that you can store just about anything in (supports searching too), excellent for time management, task tracking, knowledge bases etc.
Also, being plain html it is supported in Firefox, IE, Safari etc.
Then, on any new computer, simply install DropBox and you will now have fully synchronized access to the same TiddlyWiki file. eg. Changes/updates you make at school or work are waiting for you on your home PC immediately or once synchronised (if the PC was turned off).
Major advantages:
Foolproof synchronisation across
multiple computers.
Cross platform
Lightweight, only a web
browser and Dropbox are required.
Information is stored in
non-proprietary html.
Very simple to
No web server required, internet access only.
Nobody has mentioned SlimTimer yet ( This is a slick little web tool that is very flexible and easy to use.
The best part IMHO is that each task has a display name in your task list as well as 0..n tags that you can use for reporting. This way, my to-do list has simple names that I can relate to, and each task is tagged with the corresponding project identifier that I must report my time on.
My SlimTimer keeps track of my time spent each day or week, and when I feel like it, I pull up a report and fill the data into my company billing system.
Microsoft Project within Drop Box also provided an interesting solution if web-access is not needed. This provided excellent timeline management, particularly with task dependencies.
Project Path has also worked great for individual projects: [][1]
Maybe it's too late, but for the record, this could help you too.
Track your work, your private projects, calculate costs, send reports by email and more. Follow 3 easy steps to start time tracking:
Swift To-Do List Standard is one of my preferred application that can manage and track my tasks. It organizes them in a tree structure, has a friendly interface and is ideal for my needs.

Create Help and Manual for MVC Application

I have a small application that needs to have a professional looking Help/Manual section. The help would consist of:
I am wondering if there is a free (Easy to learn) tool that can help me produce these documents in HTML format? Any suggestions?
Thanks for the help
for a new MVC application we are designing right now we plan to use an external help site in a wiki form. There are wiki engines like mediawiki and others, the idea is to have context sensitive help ( different help page opened from different application pages ) and also to allow users to add content like formulas and examples afterwards.
The cool thing is that a wiki track changes and does the versioning for us and for free so help can grow being fully decoupled from our application source code and users can see who has added what if they want.
in our case, it's only an Intranet application so in fact we have no security issues in the internal network.

A good web data extraction/screen scraper program?

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda
but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.
You can try if you know python.
I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.
I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.
Here are some other alternatives to consider:
License the data from the provider. Call em up and ask 'em.
Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.
For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.
You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here for how simple it is.
Questions on their forum were answered very quickly.
Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk. is free and open source, available on github.
You can try UiPath Studio to address all your scraping issues. The product is built on top of a very powerful SDK dedicated to scraping and UI automation. It comes with a Web Scraping wizard perfect for extracting structured data from web pages. If the data you need to scrape is not structured, then I recommend you use the Screen Scraping wizard. This extraction can be done even in background or in a hidden IE browser.
You can easily develop workflows in the IDE and afterwards execute them separately or integrate them in your application.
You can try my software FMiner, I've developed it over 5 years, it can record macros and simulate human actions(click, fill...) on pages, here's some tutorial videos to show how to use it. Welcome evaluate it!
Visual web ripper is one of the best scraping tool,AM using these tool for the past 5 years to scrap online data's
I would definitely suggest looking at YQL from Yahoo (
It uses markup to define the structure of the webpage, then lets you run queries against it to extract data. It's a pretty neat idea, with lots of actively maintained markup structures for scraping popular sites. lets you web scrape sites by writing a simple url.
for example to scrape all the questions from stackoverflow you would write the following into your browser address bar.{}{Printing the data and placement of tree elements}*
What the url does:
Go to
Get all the links like the example provided ("Printing the data...")
Extract the question text into 'ask' column and asker's username into 'username'
Download extracted data .csv file from
Have a look at Visual Web Ripper. It cost you some money but I think it's worth it.
Have you tried Kimono Labs? It's free and pretty quick to set up with an intuitive UI. Kimono basically lets you scrape sites by training an API with CSS selectors created through a point and click interface. It does allow for batch url crawling, pagination, attribute selection, scheduled crawls, etc. and has a bunch of built in integrations.
Try Data Scraping Studio - Freeware tool.
You can create scraping agent using point and click scraper chrome extension and then export those agents in a file(*.scraping) in multi-threaded desktop app for batch crawling and more advanced features. is web based web scrapper, currently it have limited features, but its good to scrap a list of data. (example: scrap the list of questions and its autors of
I like to add features like pagination, scheuler, regex support, scrap using html class, id ...