gusl | bleg: web mirroring tool

You're viewing

gusl's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

My current project is to annotate web pages. Since these pages could change, go down, etc, I need to make a static mirror.

I have used WebSuck+WebGet, which mirror the HTMLs found. I imagine this worked great 10 years ago, before the era of dynamically-generated web content.

It has a few problems:
* if it visits a page that ends in "/" (i.e. index.html or similar), it won't know to save the file as index.html.
* if it visits a dynamically-generated page, it won't save the content as an HTML file. If I wanted to save PHPs as PHP, I would need some way to set up a server, etc, which is a bad idea. The ideal solution is to rename the saved PHP (it's saved statically) and fix the links.
* it won't fix the links to point to content in the mirror. This shouldn't be too hard to do with a search&replace script.

Any ideas?

Threaded | Top-Level Comments Only

From:

dachte.livejournal.com

On the PHP matter - while it may save files served with PHP using that extension, they're not the original PHP files but rather the generated HTML - you should be able to view them locally HTML, despite their extension not being .html.

You might be underestimating the difficulty of properly renaming the PHP for a directory that may have a mix of PHP and HTML files and may be used over many separate runs.

From: (Anonymous)

So, if your script visits a .php page and saves it as .php, it's fine.
Just keep in mind that, as Pat mentioned, what you're getting is HTML and not the PHP script (since PHP is interpreted server-side). What you'd need is to get your hands on the .phps version of the script to actually download it.