博客园备份提取

简述
程序代码

简述

在博客园记录了一些文章，想把它备份到github上，还好大部分博文都是markdown格式的，博客园也支持备份导出，但是到处的是单个的XML文件。
为了把每一篇博文单独提取出来，所以写了一个小程序来提取。

github中需要如下图所示的格式，方能正确的分类

文件名需要日期开头，文件内容中最前面一段是文章的一些描述信息

程序代码

程序是用Golang编写的，代码如下：

// cnblogs2githubpages project main.go
package main

import (
    "bytes"
    "encoding/xml"
    "fmt"
    "io/ioutil"
    "os"
    "strings"
    "time"
)

// 结构体中要能够进行XML解析，则字段名必须以大写开头
// 帖子
type Post struct {
    XMLName     xml.Name `xml:"item"`
    Title       string   `xml:"title"`
    Link        string   `xml:"link"`
    Creator     string   `xml:"dc:creator"`
    Author      string   `xml:"author"`
    PubDate     string   `xml:"pubDate"`
    Guid        string   `xml:"guid"`
    Description string   `xml:"description,CDATA"`
}

type Blogs struct {
    XMLName       xml.Name `xml:"channel"`
    Title         string   `xml:"title"`
    Link          string   `xml:"link"`
    Description   string   `xml:"description"`
    Language      string   `xml:"language"`
    LastBuildDate string   `xml:"lastBuildDate"`
    PubDate       string   `xml:"pubDate"`
    Ttl           string   `xml:"ttl"`
    Items         []Post   `xml:"item"`
}
type RSS struct {
    XMLName xml.Name `xml:"rss"`
    Blogs   Blogs    `xml:"channel"`
}

func main() {
    if len(os.Args) != 2 {
        return
    }
    backupxml, err := ioutil.ReadFile(os.Args[1])
    if err != nil {
        fmt.Println(err.Error())
        return
    }
    fmt.Println(len(backupxml))

    b := RSS{}

    err = xml.Unmarshal(backupxml, &b)
    if err != nil {
        fmt.Println(err.Error())
        return
    }
    fmt.Println(len(b.Blogs.Items))

        // 逐个导出
    for i, _ := range b.Blogs.Items {
        var item = &(b.Blogs.Items[i])
        t, _ := time.Parse(time.RFC1123, item.PubDate)
        postdate := t.Format("2006-01-02")
        // fmt.Printf("%s\n\t%s\n\t%s\n\t%s\n\t%s\n", date, item.Title, item.Link, item.Author, item.Description[0:64])
        postTitle := strings.ReplaceAll(item.Title, " ", "-")
        postTitle = strings.ReplaceAll(postTitle, "*", "")
        postTitle = strings.ReplaceAll(postTitle, "/", ".")
        postTitle = strings.ReplaceAll(postTitle, "\\", "")
        postTitle = strings.ReplaceAll(postTitle, "$", "")
        postTitle = strings.ReplaceAll(postTitle, "?", "")
        postTitle = strings.ReplaceAll(postTitle, ":", "-")
        postTitle = strings.ReplaceAll(postTitle, "。", "")
        filename := fmt.Sprintf("./%s-%s.md", postdate, postTitle)
        fmt.Println(filename)

                // 根据博文的标题，做一个简单的分类（只适合当前情况）
        var categories string = "其它"
        {
            title2 := strings.ToLower(item.Title)
            if strings.Contains(title2, "live555") {
                categories = "live555"
            } else if strings.Contains(title2, "linux") || strings.Contains(title2, "ubuntu") {
                categories = "linux"
            } else if strings.Contains(title2, "gcc") || strings.Contains(title2, "git") ||
                strings.Contains(title2, "编程") || strings.Contains(title2, "编译") ||
                strings.Contains(title2, "vc") || strings.Contains(title2, "c++") ||
                strings.Contains(title2, "visual") || strings.Contains(title2, "程序") {
                categories = "编程"
            } else if strings.Contains(title2, "gdal") || strings.Contains(title2, "proj") ||
                strings.Contains(title2, "gis") || strings.Contains(title2, "地理") {
                categories = "地理信息"
            }
        }
        var desc bytes.Buffer

        desc.WriteString("---\r\n")
        desc.WriteString("layout:  post\r\n")
        desc.WriteString("title:  \"")
        desc.WriteString(item.Title)
        desc.WriteString("\"\r\ndate:  ")
        desc.WriteString(postdate)
        desc.WriteString("\r\ncategories:  ")
        desc.WriteString(categories)
        desc.WriteString("\r\ntags:  ")
        desc.WriteString(categories)
        desc.WriteString("\r\ncomments: 1\r\n")
        desc.WriteString("---\r\n")
        tocIndex := strings.Index(item.Description, "")
        if tocIndex != -1 {
            tocIndex += len("[TOC]")
            desc.WriteString(item.Description[0:tocIndex])
            desc.WriteString("\r\n[博客园原文地址 ")
            desc.WriteString(item.Link)
            desc.WriteString("](")
            desc.WriteString(item.Link)
            desc.WriteString(")\r\n\r\n")
            desc.WriteString(item.Description[tocIndex:])
        } else {
            desc.WriteString("\r\n[TOC]\r\n[博客园文章地址 ")
            desc.WriteString(item.Link)
            desc.WriteString("](")
            desc.WriteString(item.Link)
            desc.WriteString(")\r\n")
            desc.WriteString(item.Description)
        }
        err := ioutil.WriteFile(filename, desc.Bytes(), os.ModePerm)
        if err != nil {
            fmt.Println(err.Error())
        }
    }
}

原文地址：https://www.cnblogs.com/oloroso/p/11079838.html

时间： 2024-10-01 02:22:51

博客园备份提取的相关文章

[Demo]提取个人博客园闪存+评论

还在折腾中,这里是我抓取第一页的结果,发现还是有细节要处理,等完善后再发代码脚本:Perl OK<img src=" <a href="http://static.cnblogs.com/images/ing_lucky.png"" target="_blank" class="gray">static.cnblogs.com...</a> class="ing_icon_lucky

在使用vscode中的writecnblog插件时有所启发,链接: 用vscode写博客和发布,大家可以看看. 我们在本地可以利用git轻松实现博客园文章的历史记录管理,利用博客园的MetaWeblog API 别人的介绍编写小程序来自动化上传文章(参考插件). 更进一步,将这个程序放到githook里,每次commit时自动执行,就实现了现博客园文章的备份和自动发布. 这样,你每次发布文章的步骤就简化为: 编写本地一个Git仓库内的xx.md文件 commit更改程序会自动获取diff,然后

利用GitHook实现博客园文章的备份和自动发布.md

在使用vscode中的writecnblog插件时有所启发,链接: [用vscode写博客和发布](https://www.cnblogs.com/caipeiyu/p/5475761.html),大家可以看看. 我们在本地可以利用git轻松实现博客园文章的历史记录管理,利用博客园的MetaWeblog API [别人的介绍](https://www.cnblogs.com/caipeiyu/p/5354341.html)编写小程序来自动化上传文章(参考插件). 更进一步,将这个程序放到gith

网络采集软件核心技术剖析系列（1）---如何使用C#语言获取博客园某个博主的全部随笔链接及标题

一本系列随笔概览及产生的背景自己开发的豆约翰博客备份专家软件工具问世3年多以来,深受广大博客写作和阅读爱好者的喜爱.同时也不乏一些技术爱好者咨询我,这个软件里面各种实用的功能是如何实现的. 该软件使用.NET技术开发,为回馈社区,现将该软件中用到的核心技术,开辟一个专栏,写一个系列文章,以飨广大技术爱好者. 本系列文章除了讲解网络采编发用到的各种重要技术之外,也提供了不少问题的解决思路和界面开发的编程经验,非常适合.NET开发的初级,中级读者,希望大家多多支持. 很多初学者常有此类困惑,“为

让博客园博客自动生成章节目录索引

一个好的博文除了博文的质量要好以外,好的组织结构也能让读者阅读的更加舒服与方便,我看园子里面有一些园友的博文都是分章节的,并且在博文的前面都带有章节的目录索引,点击索引之后会跳转到相应的章节阅读,并且还可以回到目录顶端,其中 Fish Li 的博文就是这种组织,当然这种结构如果是在写博文的时候人工设置那是非常麻烦的,无疑是增加了写作人的工作量.如果能自动生成章节索引岂不是节省了一大堆工作量.本来想通过FireBug看看Fish Li源码是怎么实现的,但是好像js是加密过的.那我就自己动手了,其实

Python+webdriver爬取博客园“我的闪存”并保存到本地

前篇用webdriver+phantomjs实现无浏览器的自动化过程本篇想法与实现我想要将博客园“我的闪存”部分内容爬取备份到本地文件,用到了WebDriver和Phantomjs的无界面浏览器.对于xpath的获取与校验需要用到firefox浏览器,安装firebug和firepath插件.代码如下: # -*- coding: utf-8 -*- import os,time from selenium import webdriver from selenium.webdriver

重要声明——本博客已迁往博客园

由于CSDN经常打不开,经慎重考虑,决定迁往博客园. ====================================================== 写在博客园的博客,我会定期使用博客搬家功能搬往CSDN博客做备份. ====================================================== 新博客地址:http://www.cnblogs.com/wlsandwho/ =======================================

python数据挖掘领域工具包 - wentingtu - 博客园

python数据挖掘领域工具包 - wentingtu - 博客园 python数据挖掘领域工具包原文:http://qxde01.blog.163.com/blog/static/67335744201368101922991/ Python在科学计算领域,有两个重要的扩展模块:Numpy和Scipy.其中Numpy是一个用python实现的科学计算包.包括: 一个强大的N维数组对象Array: 比较成熟的(广播)函数库: 用于整合C/C++和Fortran代码的工具包: 实用的线性代数.傅

用Qt写软件系列六：博客园客户端的设计与实现（用Fiddler抓包，用CURL提交数据，用htmlcxx解析HTML）

引言博客园是本人每日必逛的一个IT社区.尽管博文以.net技术居多,但是相对于CSDN这种业务杂乱.体系庞大的平台,博客园的纯粹更得我青睐.之前在园子里也见过不少讲解为博客园编写客户端的博文.不过似乎都是移动端的技术为主.这篇博文开始讲讲如何在PC端编写一个博客园客户端程序.一方面是因为本人对于博客园的感情:另一方面也想用Qt写点什么东西出来.毕竟在实践中学习收效更快. 登录过程分析登录功能是一个客户端程序比不可少的功能.在组装Http数据包发送请求之前,我们得看看整个登录是怎样一个过程.F