这段代码适用于虚拟mbox,但不适用于gmail外卖mbox

2024-10-01 22:36:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这个代码,可以将mbox转换成JSON。目标是将生成的JSON传输到Mongodb数据库。然而,代码是在一个虚拟的mbox上测试的示例.mbox“而且效果很好。尽管如此,在实际的mbox上测试它时,我有一个意外的输出,它确实生成了JSON文件,但是“跳过JSONification(multipart)中的MIME内容”。。我不想跳过任何东西!你知道吗

import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse

MBOX = 'antonita.mbox'
OUT_FILE = MBOX + '.json'

def cleanContent(msg):

    # Decode message from "quoted printable" format, but first
    # re-encode, since decodestring will try to do a decode of its own
    msg = quopri.decodestring(msg.encode('utf-8'))

    # Strip out HTML tags, if any are present.
    # Bail on unknown encodings if errors happen in BeautifulSoup.
    try:
        soup = BeautifulSoup(msg)
    except:
        return ''
    return ''.join(soup.findAll(text=True))

# There's a lot of data to process, and the Pythonic way to do it is with a 
# generator. See http://wiki.python.org/moin/Generators.
# Using a generator requires a trivial encoder to be passed to json for object 
# serialization.

class Encoder(json.JSONEncoder):
    def default(self, o): return  list(o)

# The generator itself...
def gen_json_msgs(mb):
    while 1:
        msg = mb.next()
        if msg is None:
            break

        yield jsonifyMessage(msg)

def jsonifyMessage(msg):
    json_msg = {'parts': []}
    for (k, v) in msg.items():
        json_msg[k] = v.decode('utf-8', 'ignore')

    # The To, Cc, and Bcc fields, if present, could have multiple items.
    # Note that not all of these fields are necessarily defined.

    for k in ['To', 'Cc', 'Bcc']:
        if not json_msg.get(k):
            continue
        json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r', '')\
                                 .replace(' ', '').decode('utf-8', 'ignore').split(',')

    for part in msg.walk():
        json_part = {}

        if part.get_content_maintype() != 'text':
            print >> sys.stderr, "Skipping MIME content in JSONification ({0})".format(part.get_content_maintype())
            continue

        json_part['contentType'] = part.get_content_type()
        content = part.get_payload(decode=False).decode('utf-8', 'ignore')
        json_part['content'] = cleanContent(content)
        json_msg['parts'].append(json_part)

    # Finally, convert date from asctime to milliseconds since epoch using the
    # $date descriptor so it imports "natively" as an ISODate object in MongoDB
    then = parse(json_msg['Date'])
    millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000)
    json_msg['Date'] = {'$date' : millis}

    return json_msg

mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)

# Write each message out as a JSON object on a separate line
# for easy import into MongoDB via mongoimport

f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
    if msg != None:
        f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()

print "All done"

输出结果:

Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (image)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
Skipping MIME content in JSONification (multipart)
All done

注意:正如一些人指出的,短语“我不想跳过任何东西”指的是我可以jSONify大多数mbox,但不能是multipart或images。因此,代码中的部分{表示味精行走():…}被标记为skipping,以证明此代码确实跳过了多部分和图像,因为没有它,我得到的JSON文件中没有图像的二进制文件等。。不过,当我考虑如何将图像和多部分转换为JSON时,它不会出现在最终的代码中。你知道吗


Tags: toinimportjsonforifmsgcontent

热门问题