[darcs-users] [patch260] Implement darcs optimize --http (and 2 more)

Petr Rockai me at mornfall.net
Thu Jun 3 20:12:19 UTC 2010


Hi,

Alexey Levan <bugs at darcs.net> writes:
> In this version files get decompressed, packed and then compressed
> back, resulting in ~2x smaller pack sizes.  Also, fixed crashing on
> access to unoptimized repositories, and implemented getting patches
> pack.

Thanks for your patch(es)! I have read through them and I have a few
things to discuss -- you should find (and reply to) the comments inline.
Ta.

Implement darcs optimize --http
--------------------------------

> hunk ./src/Darcs/Commands/Get.lhs 161
>      putInfo opts $ text "Warning: 'old-fashioned-inventory' is ignored with a darcs-2 repository\n"
>    let opts' = if formatHas Darcs2 rfsource
>                then UseFormat2:opts
> -              else if not (UseOldFashionedInventory `elem` opts)
> +              else if not (UseOldFashionedInventory `elem` opts) &&
> +                      not (Partial `elem` opts)
>                     then UseHashedInventory:filter (/= UseFormat2) opts
>                     else UseOldFashionedInventory:filter (/= UseFormat2) opts
>    createRepository opts'
I am not sure you want this hunk included in this patch. Could you
explain what was the intention of this change? (There was a discussion
in #darcs recently about --partial behaviour, but this certainly needs
to be done as a separate patch, if at all.)

> hunk ./src/Darcs/Commands/Optimize.lhs 36
>  import Darcs.Arguments ( DarcsFlag( UpgradeFormat, UseHashedInventory,
>                                      Compress, UnCompress,
>                                      NoCompress, Reorder,
> -                                    Relink, RelinkPristine, OptimizePristine ),
> +                                    Relink, RelinkPristine, OptimizePristine,
> +                                    OptimizeHTTP ),
>                          reorderPatches,
>                          uncompressNocompress,
>                          relink, relinkPristine, sibling,
> hunk ./src/Darcs/Commands/Optimize.lhs 43
>                          flagsToSiblings,
>                          upgradeFormat,
> -                        workingRepoDir, umaskOption, optimizePristine
> +                        workingRepoDir, umaskOption, optimizePristine,
> +                        optimizeHTTP
>                        )
>  import Darcs.Repository.Prefs ( getPreflist )
>  import Darcs.Repository ( Repository,
OK

> hunk ./src/Darcs/Commands/Optimize.lhs 89
>  import Storage.Hashed.Plain( readPlainTree )
>  import Storage.Hashed.Darcs( writeDarcsHashed )
>  
> +import Codec.Archive.Tar ( create )
> +
>  #include "gadts.h"
>  
>  optimizeDescription :: String
OK

> hunk ./src/Darcs/Commands/Optimize.lhs 134
>                                                   sibling, relink,
>                                                   relinkPristine,
>                                                    upgradeFormat,
> -                                                 optimizePristine]}
> +                                                 optimizePristine,
> +                                                 optimizeHTTP]}
>  
>  optimizeCmd :: [DarcsFlag] -> [String] -> IO ()
>  optimizeCmd origopts _ = do
OK

> hunk ./src/Darcs/Commands/Optimize.lhs 140
>      when (UpgradeFormat `elem` origopts) optimizeUpgradeFormat
> +    when (OptimizeHTTP `elem` origopts) doOptimizeHTTP
>      withRepoLock opts $- \repository -> do
>      if (OptimizePristine `elem` opts)
>         then doOptimizePristine repository
This should maybe run within a repository lock?

> hunk ./src/Darcs/Commands/Optimize.lhs 364
>      withCurrentDirectory dir $ do
>        gzs <- filter ((== ".gz") . takeExtension) `fmap` getDirectoryContents "."
>        mapM_ removeFile gzs
> +
> +doOptimizeHTTP :: IO ()
> +doOptimizeHTTP = withRepoLock [] $- \_ -> do
> +  createDirectoryIfMissing False packsDir
> +  create (packsDir </> "invprist.tar") "" $ map (darcsdir </>)
> +    ["hashed_inventory", "inventories", "pristine.hashed"]
> +  create (packsDir </> "patches.tar") "" $ [darcsdir </> "patches"]
> + where
> +  packsDir = darcsdir </> "packs"
>  \end{code}

I am not very happy with the "invprist" filename. At first sight, it
sounds like "inverted pristine" to me, although I understand it's meant
to say inventories and pristine. So one (probably too verbose) option
would be to name it inventories_and_pristine.tar, or possibly
inventories+pristine (but needs escaping in URLs) or maybe something
like "basic.tar" or "minimal.tar" that would convey that this is a
minimal working repository.

> hunk ./src/Darcs/Repository.hs 94
>  import Darcs.Hopefully ( PatchInfoAnd, info, extractHash )
>  import Darcs.Repository.Checkpoint ( identifyCheckpoint, writeCheckpointPatch, getCheckpoint )
>  import Darcs.Repository.ApplyPatches ( applyPatches )
> -import Darcs.Repository.HashedRepo ( applyToTentativePristine, pris2inv )
> +import Darcs.Repository.HashedRepo ( applyToTentativePristine, pris2inv, revertTentativeChanges )
>  import Darcs.Repository.InternalTypes ( Pristine(..), extractOptions )
>  import Darcs.Patch ( RepoPatch, Named, Prim, Patch, patch2patchinfo, apply )
>  import Darcs.Witnesses.Ordered ( FL(..), RL(..), bunchFL, mapFL, mapRL
> hunk ./src/Darcs/Repository.hs 98
> -                               , reverseRL ,lengthRL, (+>+) )
> +                               , reverseRL ,lengthRL, (+>+), (:\/:)(..) )
>  import Darcs.Patch.Info ( PatchInfo )
>  import Darcs.Repository.Format ( RepoProperty ( HashedInventory ),
>                                   createRepoFormat, formatHas, writeRepoFormat )
> hunk ./src/Darcs/Repository.hs 104
>  import Darcs.Repository.Prefs ( writeDefaultPrefs )
>  import Darcs.Repository.Pristine ( createPristine, flagsToPristine, createPristineFromWorking )
> -import Darcs.Patch.Depends ( getPatchesBeyondTag, areUnrelatedRepos )
> +import Darcs.Patch.Depends ( getPatchesBeyondTag, areUnrelatedRepos, findCommonAndUncommon )
>  import Darcs.Utils ( withCurrentDirectory, catchall, promptYorn, prettyError )
> hunk ./src/Darcs/Repository.hs 106
> -import Darcs.External ( copyFileOrUrl, Cachable(..) )
> +import Darcs.External ( copyFileOrUrl, Cachable(..), fetchFilePS )
> +import Darcs.Lock ( withTemp )
>  import Progress ( debugMessage, tediousSize, beginTedious, endTedious )
>  import Darcs.ProgressPatches (progressRLShowTags, progressFL)
>  import Darcs.Lock ( writeBinFile, writeDocBinFile, rmRecursive )
> hunk ./src/Darcs/Repository.hs 125
>  import ByteStringUtils( gzReadFilePS )
>  
>  import System.FilePath( (</>) )
> +import Codec.Archive.Tar ( extract )
>  
>  import qualified Data.ByteString.Char8 as BS
>  
OK (just imports)

> hunk ./src/Darcs/Repository.hs 148

>  copyRepository :: RepoPatch p => Repository p C(r u t) -> IO ()
> -copyRepository fromrepository@(Repo _ opts rf _)
> +copyRepository fromRepo@(Repo fromDir opts rf _)
>      | Partial `elem` opts && not (formatHas HashedInventory rf) =
> hunk ./src/Darcs/Repository.hs 150
> -        do isPartial <- copyPartialRepository fromrepository
> -           unless (isPartial == IsPartial) $ copyFullRepository fromrepository
> -    | otherwise = copyFullRepository fromrepository
> +        do isPartial <- copyPartialRepository fromRepo
> +           unless (isPartial == IsPartial) $ copyFullRepository fromRepo
> +    | otherwise = do
> +        isPacked <- doesRemoteFileExist $ fromDir </> darcsdir </>
> +          "packs" </> "invprist.tar"
> +        if isPacked
> +          then copyPackedRepository fromRepo
> +          else copyFullRepository fromRepo
> + where
> +  doesRemoteFileExist x = fetchFilePS x Cachable >> return True `catch`
> +    (\_ -> return False)
It is not very clear that this does what it seems to want to do. I am
assuming that your intention was to get the download going in this
place, and have the second request in copyPackedRepository below
(copyFileOrUrl) to be satisfied from the already fetched file.

However, the Cachable bit only seems to affect proxies or generally the
Pragma/Cache-Control HTTP headers and nothing much else. So it seems to
me that this code as it is will actually download the code twice. For
all I can tell, fetchFilePS is completely strict, so it will wait till
the download is complete, and it will erase its temporary file after
reading it into memory.

I think the way to fix this is to just try copyPackedRepository right
away and if it fails, fall back to copyFullRepository -- you will have
to reorder copyPackedRepository a bit to *first* grab the tarball and
only then set up prefs and finally unpack the tarball.

>  data PorNP = NotPartial | IsPartial
>               deriving ( Eq )
> hunk ./src/Darcs/Repository.hs 261
>                     debugMessage "Writing the pristine"
>                     pristineFromWorking torepository
>  
> +copyPackedRepository :: forall p C(r u t). RepoPatch p => Repository p C(r u t) -> IO ()
> +copyPackedRepository fromRepo@(Repo fromDir opts _ (DarcsRepository _ fromCache)) = do
> +  debugMessage "Copying prefs"
> +  copyFileOrUrl opts (fromDir ++ "/" ++ darcsdir ++ "/prefs/prefs")
> +    (darcsdir ++ "/prefs/prefs") (MaxAge 600) `catchall` return ()
> +  debugMessage "Grabbing lock in new repository..."
> +  Repo toDir _ toFormat (DarcsRepository toPristine toCache) <-
> +    identifyRepositoryFor fromRepo "."
> +  toCache2 <- unionRemoteCaches toCache fromCache fromDir
> +  let toRepo :: Repository p C(r u t)
> +      toRepo = Repo toDir opts toFormat $ DarcsRepository toPristine toCache2
> +  -- unpack inventory & pristine cache
> +  withTemp $ \tmp -> do
> +    let fromPacksDir = fromDir ++ "/" ++ darcsdir ++ "/packs/"
> +    copyFileOrUrl [] (fromPacksDir ++ "invprist.tar") tmp Uncachable
> +    extract toDir tmp
> +    createPristineDirectoryTree toRepo "."
> +  -- pull new patches
> +  us <- readRepo toRepo
> +  them <- readRepo fromRepo
> +  comm :\/: unc <- return $ findCommonAndUncommon us them
> +  revertTentativeChanges
> +  Sealed pw <- tentativelyMergePatches toRepo "get" opts comm unc
> +  withGutsOf toRepo $ do
> +    finalizeRepositoryChanges toRepo
> +    applyToWorking toRepo opts pw
> +    return ()
> +  -- get old patches
> +  fetchPatchesIfNecessary opts toRepo
Other than the above observation, I think this is basically OK. It may
be useful to factor the "pull new patches" bit into a helper function
(or look if there is one already, although there probably isn't). Or
maybe not. I defer that decision to you.

Recompress repository packs
---------------------------

> hunk ./src/Darcs/Commands/Optimize.lhs 24
>  {-# LANGUAGE CPP #-}
>  
>  module Darcs.Commands.Optimize ( optimize ) where
> -import Control.Monad ( when, unless )
> +import Control.Applicative ( (<$>) )
> +import Control.Monad ( when, unless, (<=<) )
>  import Data.Maybe ( isJust )
>  import System.Directory ( getDirectoryContents, doesDirectoryExist, doesFileExist )
>  import qualified Data.ByteString.Char8 as BS
> hunk ./src/Darcs/Commands/Optimize.lhs 29
> +import qualified Data.ByteString.Lazy as BL
>  
>  import Storage.Hashed.Darcs( decodeDarcsSize )
>  
> hunk ./src/Darcs/Commands/Optimize.lhs 91
>  import Storage.Hashed.Plain( readPlainTree )
>  import Storage.Hashed.Darcs( writeDarcsHashed )
>  
> -import Codec.Archive.Tar ( create )
> +import Codec.Archive.Tar ( write )
> +import Codec.Archive.Tar.Entry ( fileEntry, toTarPath )
> +import Codec.Compression.GZip ( compress )
>  
>  #include "gadts.h"
>  
OK (imports again)

> hunk ./src/Darcs/Commands/Optimize.lhs 371
>  
>  doOptimizeHTTP :: IO ()
>  doOptimizeHTTP = withRepoLock [] $- \_ -> do
> +  rf <- either fail return =<< identifyRepoFormat "."
> +  unless (formatHas HashedInventory rf) $ fail
> +    "Unsupported repository format"
>    createDirectoryIfMissing False packsDir
Good point. Does this maybe belong to a separate patch, or maybe to the
previous one? Does not seem to be directly related to re-compression.

> hunk ./src/Darcs/Commands/Optimize.lhs 375
> -  create (packsDir </> "invprist.tar") "" $ map (darcsdir </>)
> -    ["hashed_inventory", "inventories", "pristine.hashed"]
> -  create (packsDir </> "patches.tar") "" $ [darcsdir </> "patches"]
> +  i <- fileEntry' $ darcsdir </> "hashed_inventory"
> +  is <- tarDarcsDir "inventories"
> +  pr <- tarDarcsDir "pristine.hashed"
> +  BL.writeFile (packsDir </> "invprist.tar.gz") . compress $ write (i : (is ++ pr))
> +  ps <- tarDarcsDir "patches"
> +  BL.writeFile (packsDir </> "patches.tar.gz") . compress $ write ps
>   where
>    packsDir = darcsdir </> "packs"
> hunk ./src/Darcs/Commands/Optimize.lhs 383
> +  fileEntry' x = do
> +    content <- BL.fromChunks . return <$> gzReadFilePS x
> +    tp <- either fail return (toTarPath False x)
> +    return $ fileEntry tp content
> +  dirContents d = map (d </>) . filter ((/= '.') . head) <$> getDirectoryContents d
> +  tarDarcsDir = mapM fileEntry' <=< dirContents . (darcsdir </>)
OK

> hunk ./src/Darcs/Repository.hs 48
>      ) where
>  
>  import System.Exit ( ExitCode(..), exitWith )
> +import Data.List ( isSuffixOf )
>  import Data.Maybe( catMaybes )
>  
>  import Darcs.Repository.State( readRecorded, readUnrecorded, readWorking, unrecordedChanges
> hunk ./src/Darcs/Repository.hs 86
>  import URL ( maxPipelineLength )
>  
>  import Control.Monad ( unless, when )
> -import System.Directory ( createDirectory, renameDirectory )
> +import System.Directory ( createDirectory, renameDirectory, createDirectoryIfMissing )
>  import System.IO.Error ( isAlreadyExistsError )
>  
>  import qualified Darcs.Repository.DarcsRepo as DarcsRepo
> hunk ./src/Darcs/Repository.hs 108
>  import Darcs.Patch.Depends ( getPatchesBeyondTag, areUnrelatedRepos, findCommonAndUncommon )
>  import Darcs.Utils ( withCurrentDirectory, catchall, promptYorn, prettyError )
>  import Darcs.External ( copyFileOrUrl, Cachable(..), fetchFilePS )
> -import Darcs.Lock ( withTemp )
>  import Progress ( debugMessage, tediousSize, beginTedious, endTedious )
>  import Darcs.ProgressPatches (progressRLShowTags, progressFL)
>  import Darcs.Lock ( writeBinFile, writeDocBinFile, rmRecursive )
> hunk ./src/Darcs/Repository.hs 125
>  import ByteStringUtils( gzReadFilePS )
>  
>  import System.FilePath( (</>) )
> -import Codec.Archive.Tar ( extract )
> -
> +import qualified Codec.Archive.Tar as Tar
> +import Codec.Compression.GZip ( compress, decompress )
>  import qualified Data.ByteString.Char8 as BS
> hunk ./src/Darcs/Repository.hs 128
> +import qualified Data.ByteString.Lazy as BL
>  
>  #include "impossible.h"
>  
OK, more imports

> hunk ./src/Darcs/Repository.hs 155
>             unless (isPartial == IsPartial) $ copyFullRepository fromRepo
>      | otherwise = do
>          isPacked <- doesRemoteFileExist $ fromDir </> darcsdir </>
> -          "packs" </> "invprist.tar"
> +          "packs" </> "invprist.tar.gz"
>          if isPacked
>            then copyPackedRepository fromRepo
>            else copyFullRepository fromRepo
OK

> hunk ./src/Darcs/Repository.hs 160
>   where
> -  doesRemoteFileExist x = fetchFilePS x Cachable >> return True `catch`
> -    (\_ -> return False)
> +  doesRemoteFileExist x = (fetchFilePS x Cachable >> return True) `catchall`
> +    return False
Changes catch to catchall, probably quite OK. Maybe should be amend of
previous though, instead of part of this one.

>  data PorNP = NotPartial | IsPartial
>               deriving ( Eq )
> hunk ./src/Darcs/Repository.hs 273
>    toCache2 <- unionRemoteCaches toCache fromCache fromDir
>    let toRepo :: Repository p C(r u t)
>        toRepo = Repo toDir opts toFormat $ DarcsRepository toPristine toCache2
> +      fromPacksDir = fromDir ++ "/" ++ darcsdir ++ "/packs/"
> +  createDirectoryIfMissing False $ toDir </> darcsdir </> "inventories"
> +  createDirectoryIfMissing False $ toDir </> darcsdir </> "pristine.hashed"
> +  createDirectoryIfMissing False $ toDir </> darcsdir </> "patches"
>    -- unpack inventory & pristine cache
> hunk ./src/Darcs/Repository.hs 278
> -  withTemp $ \tmp -> do
> -    let fromPacksDir = fromDir ++ "/" ++ darcsdir ++ "/packs/"
> -    copyFileOrUrl [] (fromPacksDir ++ "invprist.tar") tmp Uncachable
> -    extract toDir tmp
> -    createPristineDirectoryTree toRepo "."
> +  writeCompressed . Tar.read . decompress . BL.fromChunks . return =<<
> +    fetchFilePS (fromPacksDir ++ "invprist.tar.gz") Uncachable
This is a little dangerous, since it means having the complete tarball
in memory at once. The version with temporary directory is probably
better. You can presumably decompress and unpack the tarball in a
streamed fashion from disk somehow? This may need some benchmarking and
maybe some creative way to measure memory usage during decompression.

> +  createPristineDirectoryTree toRepo "."
>    -- pull new patches
>    us <- readRepo toRepo
>    them <- readRepo fromRepo
> hunk ./src/Darcs/Repository.hs 292
>      applyToWorking toRepo opts pw
>      return ()
>    -- get old patches
> -  fetchPatchesIfNecessary opts toRepo
> +  writeCompressed . Tar.read . decompress . BL.fromChunks .
> +    return =<< fetchFilePS (fromPacksDir ++ "patches.tar.gz") Uncachable
> + where
> +  writeCompressed Tar.Done = return ()
> +  writeCompressed (Tar.Next x xs) = case Tar.entryContent x of
> +    Tar.NormalFile x' _ -> do
> +      let p = Tar.entryPath x
> +      BL.writeFile p $ if "hashed_inventory" `isSuffixOf` p
> +        then x'
> +        else compress x'
> +      writeCompressed xs
> +    _ -> fail ""
> +  writeCompressed (Tar.Fail e) = fail e
OK

Yours,
    Petr.


More information about the darcs-users mailing list