Python实现简单HTML表格解析的方法
本文实例讲述了Python实现简单HTML表格解析的方法。分享给大家供大家参考。具体分析如下:
这里依赖libxml2dom,确保首先安装!导入到你的脚步并调用parse_tables()函数。
1.source=astringcontainingthesourcecodeyoucanpassinjustthetableortheentirepagecode
2.headers=alistofintsORalistofstrings
Iftheheadersareintsthisisfortableswithnoheader,justlistthe0basedindexoftherowsinwhichyouwanttoextractdata.
Iftheheadersarestringsthisisfortableswithheadercolumns(withthetags)itwillpulltheinformationfromthespecifiedcolumns
3.The0basedindexofthetableinthesourcecode.Iftherearemultipletablesandthetableyouwanttoparseisthethirdtableinthecodethenpassinthenumber2here
Itwillreturnalistoflists.eachinnerlistwillcontaintheparsedinformation.
具体代码如下:
#Thegoaloftableparseristogetspecificinformationfromspecific #columnsinatable. #Input:sourcecodefromatypicalwebsite #Arguments:alistofheaderstheuserwantstoreturn #Output:Alistoflistsofthedataineachrow importlibxml2dom defparse_tables(source,headers,table_index): """parse_tables(stringsource,listheaders,table_index) headersmaybealistofstringsifthetablehasheadersdefinedor headersmaybealistofintsifnoheadersdefinedthiswillgetdata fromtherowsindex. Thismethodreturnsalistoflists """ #Determineiftheheaderslistisstringsorintsandmakesurethey #areallthesametype j=0 print'Printingheaders:',headers #routetothecorrectfunction #iftheheadertypeisint iftype(headers[0])==type(1): #runno_headerfunction returnno_header(source,headers,table_index) #iftheheadertypeisstring eliftype(headers[0])==type('a'): #runtheheader_givenfunction returnheader_given(source,headers,table_index) else: #returnnoneiftheheadersaren'tcorrect returnNone #Thisfunctiontakesinthesourcecodeofthewholepageastringlistof #headersandtheindexnumberofthetableonthepage.Itreturnsalistof #listswiththescrapedinformation defheader_given(source,headers,table_index): #initiatealisttoholethereturnlist return_list=[] #initiatealisttoholdtheindexnumbersofthedataintherows header_index=[] #getadocumentobjectoutofthesourcecode doc=libxml2dom.parseString(source,html=1) #getthetablesfromthedocument tables=doc.getElementsByTagName('table') try: #trytogetfocueonthedesiredtable main_table=tables[table_index] except: #ifthetabledoesn'texitsthenreturnanerror return['Thetableindexwasnotfound'] #getalistofheadersinthetable table_headers=main_table.getElementsByTagName('th') #needasentryvaluefortheheaderloop loop_sentry=0 #loopthrougheachheaderlookingformatches forheaderintable_headers: #iftheheaderisinthedesiredheaderslist ifheader.textContentinheaders: #addittotheheader_index header_index.append(loop_sentry) #addonetotheloop_sentry loop_sentry+=1 #gettherowsfromthetable rows=main_table.getElementsByTagName('tr') #sentryvaluedetectingifthefirstrowisbeingviewed row_sentry=0 #loopthroughtherowsinthetable,skippingthefirstrow forrowinrows: #ifrow_sentryis0thisisourfirstrow ifrow_sentry==0: #maketherow_sentrynot0 row_sentry=1337 continue #getallcellsfromthecurrentrow cells=row.getElementsByTagName('td') #initiatealisttoappendintothereturn_list cell_list=[] #iteratethroughalloftheheaderindex's foriinheader_index: #appendthecellstextcontenttothecell_list cell_list.append(cells[i].textContent) #appendthecell_listtothereturn_list return_list.append(cell_list) #returnthereturn_list returnreturn_list #Thisfunctiontakesinthesourcecodeofthewholepageanintlistof #headersindicatingtheindexnumberoftheneededitemandtheindexnumber #ofthetableonthepage.Itreturnsalistoflistswiththescrapedinfo defno_header(source,headers,table_index): #initiatealisttoholdthereturnlist return_list=[] #getadocumentobjectoutofthesourcecode doc=libxml2dom.parseString(source,html=1) #getthetablesfromdocument tables=doc.getElementsByTagName('table') try: #Trytogetfocusonthedesiredtable main_table=tables[table_index] except: #ifthetabledoesn'texitsthenreturnanerror return['Thetableindexwasnotfound'] #getalloftherowsoutofthemain_table rows=main_table.getElementsByTagName('tr') #loopthrougheachrow forrowinrows: #getallcellsfromthecurrentrow cells=row.getElementsByTagName('td') #initiatealisttoappendintothereturn_list cell_list=[] #loopthroughthelistofdesiredheaders foriinheaders: try: #trytoaddtextfromthecellintothecell_list cell_list.append(cells[i].textContent) except: #ifthereisanerrorusuallyanindexerrorjustcontinue continue #appendthedatascrapedintothereturn_list return_list.append(cell_list) #returnthereturnlist returnreturn_list
希望本文所述对大家的Python程序设计有所帮助。